Salesforce/fineweb_deduplicated
General NLPEnglish
Created by Salesforce at 2024, the Salesforce/fineweb_deduplicated is a General NLP dataset in English in Parquet format.
About Salesforce/fineweb_deduplicated
TL;DR
Fineweb is a popular and high quality open dataset. This dataset is a deduplicated version of Fineweb - removing rows with duplicate text, collecting counts.
Motivation
Fineweb is an open text dataset intended for training lan...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- Salesforce
- Year
- 2024