Strengths & Limitations

Balanced assessment

Strengths

Collects problems shortly after contests to avoid training data contamination.
Includes 1055 problems across difficulty levels in the latest release.
Evaluates multiple coding capabilities beyond code generation.
Provides time-annotated problems for testing model generalization.
Open-source with a reproducible evaluation toolkit.

Limitations

Official repository had bugs affecting scores by up to 50%, fixed via community pull requests.
Limited to Python solutions and competitive programming problems.
Relies on external contest platforms for problem sourcing.