Continuous Problem Collection
Automatically gathers new coding problems from live contests on LeetCode, AtCoder, and CodeForces to maintain an up-to-date benchmark.
Time-Annotated Problem Sets
Annotates problems with release dates to enable evaluation on data unseen during model training, supporting contamination-free benchmarking.
Multiple Evaluation Scenarios
Supports code generation, self-repair, code execution, and test output prediction to comprehensively assess LLM coding capabilities.
Execution-Based Accuracy Metrics
Uses hidden test cases to measure functional correctness of generated code through actual code execution.
Open-Source Toolkit with Leaderboard
Provides a reproducible evaluation framework and a leaderboard to compare LLM performance across difficulty levels.