Key strength: High-quality tasks verified manually and with language model assistance to ensure reliability.
Top feature: Comprehensive Task Dataset
Best for: AI Agent Performance Evaluation
Pricing: open-source
Quick start: Install Terminal-Bench
Quick reference
Key strength: High-quality tasks verified manually and with language model assistance to ensure reliability.
Top feature: Comprehensive Task Dataset
Best for: AI Agent Performance Evaluation
Pricing: open-source
Quick start: Install Terminal-Bench