ByteDance-Seed/mga-fineweb-edu
Text GenerationENodc-by
The ByteDance-Seed/mga-fineweb-edu dataset is a EN text generation resource from ByteDance-Seed at 2025. With 616 downloads and 36 likes, it is actively used by the community. It is released under the odc-by license and is a 100M<n<1B-scale dataset.
About ByteDance-Seed/mga-fineweb-edu
Massive Genre-Audience Augment Fineweb-Edu Corpus
This dataset is a synthetic pretraining corpus described in paper Reformulation for Pretraining Data Augmentation.
Overview of synthesis framework. Our method expands the original corpus throug...
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100M<n<1B
- Creator
- ByteDance-Seed
- Year
- 2025
- License
- odc-by
- Downloads
- 616
- Likes
- 36