The Problem - s-bench-pro

⚠️ Without S. Bench Pro: Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
⚠️ Without S. Bench Pro: Full scaled evaluation requires setup of Modal, Docker, and credential management.