DATA-MASK/FineWeb-Mask
Text GenerationENapache-2.0
Created by DATA-MASK at 2025, the DATA-MASK/FineWeb-Mask is a text generation dataset in EN in Parquet format. With 34K downloads and 6 likes, it is actively used by the community. It is released under the apache-2.0 license and is a n>1T-scale dataset.
About DATA-MASK/FineWeb-Mask
FineWeb-Mask
š DATAMASK Paper | š» GitHub Repository | š¦ Fineweb-Mask Dataset
š Introduction
FineWeb-Mask is a 1.5 trillion token, high-efficiency pre-training dataset curated using the DATAMASK framework. Developed by the ByteDance...
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- n>1T
- Creator
- DATA-MASK
- Year
- 2025
- License
- apache-2.0
- Downloads
- 33973
- Likes
- 6