Key Features

What you can do

Comprehensive Task Dataset

Offers around 89 tasks with instructions, test scripts, and oracle solutions across various difficulty levels and categories.

Docker-Based Execution Harness

Runs tasks in containerized environments using pre-built Docker images, with support for local builds and custom configurations.

Command-Line Interface Tool

Provides CLI commands (`tb` or `tb run`) for easy evaluation of AI agents on the benchmark tasks.

Public Leaderboard

Tracks and displays success rates of agents and models on Terminal-Bench 2.0 tasks for performance comparison.

Adapter System for Custom Tasks

Allows users to add custom tasks, modify prompts, and integrate new metrics with pull request support to the task registry.