Use Cases

Real-world applications

AI Agent Performance Evaluation

Developers and researchers can benchmark AI agents on terminal-based tasks such as code compilation, server setup, and vulnerability fixing.

Custom Task Integration

Users can extend the benchmark by adding new tasks or modifying existing ones to suit specific evaluation needs.