AI Agent Performance Evaluation
Developers and researchers can benchmark AI agents on terminal-based tasks such as code compilation, server setup, and vulnerability fixing.
Custom Task Integration
Users can extend the benchmark by adding new tasks or modifying existing ones to suit specific evaluation needs.