Skip to content

gair-prox/DCLM-pro

Text GenerationENodc-by

Gair-prox/DCLM-pro is a text generation-focused dataset in EN distributed in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 16.8K times.

About gair-prox/DCLM-pro

📚 DCLM-pro ArXiv | Models | Code DCLM-pro is refined from DCLM using the ProX refining framework. It contains about >500B high quality tokens, ready for general language model pre-training. License DCLM-pro is based on DCLM...

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
100M<n<1B
Creator
gair-prox
Year
2025
License
odc-by
Downloads
16764
Likes
13
Download Homepage

Related Text Generation datasets

FAQ