Strengths & Limitations

Balanced assessment

Strengths

  • Collects problems shortly after contests to avoid training data contamination.
  • Includes 1055 problems across difficulty levels in the latest release.
  • Evaluates multiple coding capabilities beyond code generation.
  • Provides time-annotated problems for testing model generalization.
  • Open-source with a reproducible evaluation toolkit.

Limitations

  • Official repository had bugs affecting scores by up to 50%, fixed via community pull requests.
  • Limited to Python solutions and competitive programming problems.
  • Relies on external contest platforms for problem sourcing.