COR Brief
AI ToolsAgents & AutomationTerminal Bench 2.0
Agents & Automation

Terminal Bench 2.0

Terminal-Bench 2.0 is an updated benchmark and evaluation harness designed to assess AI agents' performance on terminal-based tasks. It provides a dataset of approximately 89 tasks that cover real-world software engineering challenges such as compiling code, training models, setting up servers, and vulnerability fixing. Each task includes English instructions, test scripts for verification, and reference solutions, executed within containerized environments using Docker images. The update from version 1.0 addresses prior issues with task reliability through manual and language model-assisted verification to improve quality. The benchmark connects AI agents or language models to a terminal sandbox environment, measuring their success rates on tasks that test terminal mastery. It includes a command-line interface tool for running evaluations and supports custom Docker configurations. Terminal-Bench 2.0 also features a public leaderboard to track agent performance and an adapter system for adding custom tasks. The project is open-source, actively maintained, and supported by a community of contributors and users.

Updated Jan 20, 2026open-source

Terminal-Bench 2.0 is an open-source benchmark for evaluating AI agents on terminal-based software engineering tasks using containerized environments.

Pricing
open-source
Category
Agents & Automation
Company
Interactive PresentationOpen Fullscreen ↗
01
Offers around 89 tasks with instructions, test scripts, and oracle solutions across various difficulty levels and categories.
02
Runs tasks in containerized environments using pre-built Docker images, with support for local builds and custom configurations.
03
Provides CLI commands (`tb` or `tb run`) for easy evaluation of AI agents on the benchmark tasks.
04
Tracks and displays success rates of agents and models on Terminal-Bench 2.0 tasks for performance comparison.
05
Allows users to add custom tasks, modify prompts, and integrate new metrics with pull request support to the task registry.

AI Agent Performance Evaluation

Developers and researchers can benchmark AI agents on terminal-based tasks such as code compilation, server setup, and vulnerability fixing.

Custom Task Integration

Users can extend the benchmark by adding new tasks or modifying existing ones to suit specific evaluation needs.

1
Install Terminal-Bench
Run uv tool install terminal-bench or pip install terminal-bench to install the package.
2
Run Evaluations
Use the CLI commands tb or tb run to execute benchmark tasks and evaluate AI agents.
3
Configure Custom Docker Images
Set use_prebuilt_image=false in CLI commands or Python evaluation scripts to use custom Docker images.
4
View Leaderboard
Access the public leaderboard at https://www.tbench.ai/leaderboard/terminal-bench/2.0 to compare agent performance.
5
Contribute Tasks or Adapters
Follow documentation to add new tasks or adapters by placing files in the tasks folder and submitting a pull request.
📊

Strategic Context for Terminal Bench 2.0

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →
7 days free · No credit card
Pricing
Model: open-source

Terminal-Bench 2.0 is distributed as an open-source pip package with no pricing information available.

Assessment
Strengths
  • High-quality tasks verified manually and with language model assistance to ensure reliability.
  • Widely adopted as a standard benchmark by frontier AI labs since the initial release.
  • Flexible Docker and container support including cloud deployment and local builds.
  • Active community with approximately 1,000 Discord members and 100 GitHub contributors.
  • Public leaderboard enables transparent comparison of agent performance.
Limitations
  • Some tasks remain brittle due to external dependencies, such as YouTube anti-bot measures affecting prior solutions.
  • Differences in evaluation frameworks (e.g., Inspect AI ReAct vs. Harbor) can affect result consistency.
  • Currently in beta stage with planned expansions and no formal published releases.