Language Modeling Datasets
There are 3 language modeling datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.
Language Modeling is the task of predicting the next token in a sequence — the core pre-training objective behind every LLM. We catalog 3 datasets for it.
Updated June 2026