Skip to content

mlfoundations/dclm-baseline-1.0-parquet

General NLPEN

Created by mlfoundations at 2026, the mlfoundations/dclm-baseline-1.0-parquet is a General NLP dataset in EN in Parquet format.

About mlfoundations/dclm-baseline-1.0-parquet

DCLM-baseline Note: this is an identical copy of https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0, where all the files have been mapped to a parquet format. DCLM-baseline is a 4T token / 3B document pretraining dataset th...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
N/A
Creator
mlfoundations
Year
2026
Download

Related General NLP datasets

FAQ