Skip to content

HuggingFaceFW/fineweb_edu_100BT-shuffled

General NLPENodc-by

Created by HuggingFaceFW at 2026, the HuggingFaceFW/fineweb_edu_100BT-shuffled is a General NLP dataset in EN containing 102,063,987 records in Parquet format. With 47.4K downloads and 0 likes, it is actively used by the community. It is released under the odc-by license and is a 100M<n<1B-scale dataset.

About HuggingFaceFW/fineweb_edu_100BT-shuffled

FineWeb-Edu 100BT (Shuffled) A globally shuffled version of HuggingFaceFW/fineweb_edu_100BT. Part of the Smol-Data collection — tried and tested mixes for strong pretraining. Dataset Description This dataset contains the same ~100B t...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
102063987
Size
100M<n<1B
Creator
HuggingFaceFW
Year
2026
License
odc-by
Downloads
47407
Likes
0
Download Homepage

Related General NLP datasets

FAQ