- Without Terminal Bench 2.0: Some tasks remain brittle due to external dependencies, such as YouTube anti-bot measures affecting prior solutions.
- Without Terminal Bench 2.0: Differences in evaluation frameworks (e.g., Inspect AI ReAct vs. Harbor) can affect result consistency.
- Without Terminal Bench 2.0: Currently in beta stage with planned expansions and no formal published releases.