Skip to content

mlfoundations/dclm-baseline-1.0

General NLPEnglish

Mlfoundations/dclm-baseline-1.0 is a General NLP-focused dataset in English distributed in Parquet format.

About mlfoundations/dclm-baseline-1.0

DCLM-baseline DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime. ...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Creator
mlfoundations
Year
2026
Download

Related General NLP datasets

FAQ