Skip to content

DATA-MASK/FineWeb-Mask

Text GenerationENapache-2.0

Created by DATA-MASK at 2025, the DATA-MASK/FineWeb-Mask is a text generation dataset in EN in Parquet format. With 34K downloads and 6 likes, it is actively used by the community. It is released under the apache-2.0 license and is a n>1T-scale dataset.

About DATA-MASK/FineWeb-Mask

FineWeb-Mask šŸ“œ DATAMASK Paper | šŸ’» GitHub Repository | šŸ“¦ Fineweb-Mask Dataset šŸ“š Introduction FineWeb-Mask is a 1.5 trillion token, high-efficiency pre-training dataset curated using the DATAMASK framework. Developed by the ByteDance...

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
n>1T
Creator
DATA-MASK
Year
2025
License
apache-2.0
Downloads
33973
Likes
6
Download Homepage

Related Text Generation datasets

FAQ