gair-prox/DCLM-pro
Text GenerationENodc-by
Gair-prox/DCLM-pro is a text generation-focused dataset in EN distributed in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 16.8K times.
About gair-prox/DCLM-pro
📚 DCLM-pro
ArXiv | Models | Code
DCLM-pro is refined from DCLM using the ProX refining framework.
It contains about >500B high quality tokens, ready for general language model pre-training.
License
DCLM-pro is based on DCLM...
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100M<n<1B
- Creator
- gair-prox
- Year
- 2025
- License
- odc-by
- Downloads
- 16764
- Likes
- 13