The Solution - terminal-bench-20

Terminal-Bench 2.0 is an updated benchmark and evaluation harness designed to assess AI agents' performance on terminal-based tasks. It provides a dataset of approximately 89 tasks that cover real-world software engineering challenges such as compiling code, training models, setting up servers, and vulnerability fixing.