- Without SWEBench: Full benchmark requires significant compute resources for evaluation.
- Without SWEBench: Multimodal test evaluation depends on SWE-bench API, limiting standalone use.
- Without SWEBench: No out-of-the-box support for running individual tests without modifying the harness.