mlfoundations/dclm-baseline-1.0
General NLPEnglish
Mlfoundations/dclm-baseline-1.0 is a General NLP-focused dataset in English distributed in Parquet format.
About mlfoundations/dclm-baseline-1.0
DCLM-baseline
DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.
Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime.
...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- mlfoundations
- Year
- 2026