LLM Benchmarks & Leaderboards

Each benchmark below ranks large language models on a specific task — and pairs every score with the model's cost per million tokens, so you can see which model is the best value, not just the highest score.

LMArena (Chatbot Arena) Elo

192 models

human-preference

View leaderboard →

LiveBench

71 models

reasoning/coding

View leaderboard →

SWE-bench Verified

49 models

coding

View leaderboard →

Aider Polyglot

40 models

coding

View leaderboard →

Looking for the underlying datasets? Browse our benchmark datasets (MMLU, GPQA, HumanEval and more).