Comprehensive Task Dataset
Offers around 89 tasks with instructions, test scripts, and oracle solutions across various difficulty levels and categories.
Docker-Based Execution Harness
Runs tasks in containerized environments using pre-built Docker images, with support for local builds and custom configurations.
Command-Line Interface Tool
Provides CLI commands (`tb` or `tb run`) for easy evaluation of AI agents on the benchmark tasks.
Public Leaderboard
Tracks and displays success rates of agents and models on Terminal-Bench 2.0 tasks for performance comparison.
Adapter System for Custom Tasks
Allows users to add custom tasks, modify prompts, and integrate new metrics with pull request support to the task registry.