Terminal Bench 2.0
Terminal-Bench 2.0 is an updated benchmark and evaluation harness designed to assess AI agents' performance on terminal-based tasks. It provides a dataset of approximately 89 tasks that cover real-world software engineering challenges such as compiling code, training models, setting up servers, and vulnerability fixing. Each task includes English instructions, test scripts for verification, and reference solutions, executed within containerized environments using Docker images. The update from version 1.0 addresses prior issues with task reliability through manual and language model-assisted verification to improve quality. The benchmark connects AI agents or language models to a terminal sandbox environment, measuring their success rates on tasks that test terminal mastery. It includes a command-line interface tool for running evaluations and supports custom Docker configurations. Terminal-Bench 2.0 also features a public leaderboard to track agent performance and an adapter system for adding custom tasks. The project is open-source, actively maintained, and supported by a community of contributors and users.
Terminal-Bench 2.0 is an open-source benchmark for evaluating AI agents on terminal-based software engineering tasks using containerized environments.
AI Agent Performance Evaluation
Developers and researchers can benchmark AI agents on terminal-based tasks such as code compilation, server setup, and vulnerability fixing.
Custom Task Integration
Users can extend the benchmark by adding new tasks or modifying existing ones to suit specific evaluation needs.
uv tool install terminal-bench or pip install terminal-bench to install the package.tb or tb run to execute benchmark tasks and evaluate AI agents.use_prebuilt_image=false in CLI commands or Python evaluation scripts to use custom Docker images.