Skip to content

LLM Benchmarks & Leaderboards

Each benchmark below ranks large language models on a specific task — and pairs every score with the model's cost per million tokens, so you can see which model is the best value, not just the highest score.

Looking for the underlying datasets? Browse our benchmark datasets (MMLU, GPQA, HumanEval and more).