HuggingFaceFW/fineweb_edu_100BT-shuffled
General NLPENodc-by
Created by HuggingFaceFW at 2026, the HuggingFaceFW/fineweb_edu_100BT-shuffled is a General NLP dataset in EN containing 102,063,987 records in Parquet format. With 47.4K downloads and 0 likes, it is actively used by the community. It is released under the odc-by license and is a 100M<n<1B-scale dataset.
About HuggingFaceFW/fineweb_edu_100BT-shuffled
FineWeb-Edu 100BT (Shuffled)
A globally shuffled version of HuggingFaceFW/fineweb_edu_100BT.
Part of the Smol-Data collection — tried and tested mixes for strong pretraining.
Dataset Description
This dataset contains the same ~100B t...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- 102063987
- Size
- 100M<n<1B
- Creator
- HuggingFaceFW
- Year
- 2026
- License
- odc-by
- Downloads
- 47407
- Likes
- 0