Key Features - terminal-bench-20

✨

Offers around 89 tasks with instructions, test scripts, and oracle solutions across various difficulty levels and categories.

✨

Runs tasks in containerized environments using pre-built Docker images, with support for local builds and custom configurations.

✨

Provides CLI commands (`tb` or `tb run`) for easy evaluation of AI agents on the benchmark tasks.

✨

Tracks and displays success rates of agents and models on Terminal-Bench 2.0 tasks for performance comparison.

✨

Allows users to add custom tasks, modify prompts, and integrate new metrics with pull request support to the task registry.