Strengths
- Collects problems shortly after contests to avoid training data contamination.
- Includes 1055 problems across difficulty levels in the latest release.
- Evaluates multiple coding capabilities beyond code generation.
- Provides time-annotated problems for testing model generalization.
- Open-source with a reproducible evaluation toolkit.
Limitations
- Official repository had bugs affecting scores by up to 50%, fixed via community pull requests.
- Limited to Python solutions and competitive programming problems.
- Relies on external contest platforms for problem sourcing.