Skip to content

ByteDance-Seed/mga-fineweb-edu

Text GenerationENodc-by

The ByteDance-Seed/mga-fineweb-edu dataset is a EN text generation resource from ByteDance-Seed at 2025. With 616 downloads and 36 likes, it is actively used by the community. It is released under the odc-by license and is a 100M<n<1B-scale dataset.

About ByteDance-Seed/mga-fineweb-edu

Massive Genre-Audience Augment Fineweb-Edu Corpus This dataset is a synthetic pretraining corpus described in paper Reformulation for Pretraining Data Augmentation. Overview of synthesis framework. Our method expands the original corpus throug...

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
100M<n<1B
Creator
ByteDance-Seed
Year
2025
License
odc-by
Downloads
616
Likes
36
Download Homepage

Related Text Generation datasets

FAQ