Key Features

What you can do

Continuous Problem Collection

Automatically gathers new coding problems from live contests on LeetCode, AtCoder, and CodeForces to maintain an up-to-date benchmark.

Time-Annotated Problem Sets

Annotates problems with release dates to enable evaluation on data unseen during model training, supporting contamination-free benchmarking.

Multiple Evaluation Scenarios

Supports code generation, self-repair, code execution, and test output prediction to comprehensively assess LLM coding capabilities.

Execution-Based Accuracy Metrics

Uses hidden test cases to measure functional correctness of generated code through actual code execution.

Open-Source Toolkit with Leaderboard

Provides a reproducible evaluation framework and a leaderboard to compare LLM performance across difficulty levels.