mlfoundations/dclm-baseline-1.0-parquet
General NLPEN
Created by mlfoundations at 2026, the mlfoundations/dclm-baseline-1.0-parquet is a General NLP dataset in EN in Parquet format.
About mlfoundations/dclm-baseline-1.0-parquet
DCLM-baseline
Note: this is an identical copy of https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0, where all the files have been mapped to a parquet format.
DCLM-baseline is a 4T token / 3B document pretraining dataset th...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- mlfoundations
- Year
- 2026