Code & Development

Live Codebench

LiveCodeBench is an open-source benchmark designed to evaluate large language models (LLMs) on coding tasks derived from competitive programming contests. It continuously collects problems from platforms such as LeetCode, AtCoder, and CodeForces, ensuring that the problems used for evaluation are released after the model's training cutoff date to prevent data contamination. The benchmark includes over 1,000 problems spanning easy to hard difficulty levels as of its latest release (v6). LiveCodeBench assesses multiple aspects of coding capabilities including code generation, self-repair, code execution, and test output prediction, using execution-based accuracy metrics with hidden test cases for functional correctness.

Updated Dec 17, 2025open-source

Visit Live Codebench ↗Visual Guide

Overview

LiveCodeBench provides contamination-free, time-annotated evaluation of LLMs on competitive programming problems across multiple coding scenarios.

Pricing

open-source

Benchmarking LLM Coding Performance

Researchers and developers can evaluate the coding abilities of large language models on recent competitive programming problems that the models have not seen during training.

Testing Code Generation and Repair

Use LiveCodeBench to assess not only code generation but also the model's ability to self-repair code and predict test outputs.

Quick Start

Clone the Repository

Run git clone https://github.com/LiveCodeBench/LiveCodeBench.git and navigate into the directory with cd LiveCodeBench.

Install Dependencies

Install required dependencies using the uv command to verify installation.

Download Dataset Release

Download a dataset release such as release_v6 which contains 1055 problems.

Run Evaluations

Use provided scripts to run evaluations on supported LLMs for scenarios like code generation.

View Leaderboard

Submit results and view the leaderboard on the official website to compare model performance.

📊

Strategic Context for Live Codebench

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →

7 days free · No credit card

Assessment

Strengths

Collects problems shortly after contests to avoid training data contamination.
Includes 1055 problems across difficulty levels in the latest release.
Evaluates multiple coding capabilities beyond code generation.
Provides time-annotated problems for testing model generalization.
Open-source with a reproducible evaluation toolkit.

Limitations

Official repository had bugs affecting scores by up to 50%, fixed via community pull requests.
Limited to Python solutions and competitive programming problems.
Relies on external contest platforms for problem sourcing.