Skip to content

Salesforce/fineweb_deduplicated

General NLPEnglish

Created by Salesforce at 2024, the Salesforce/fineweb_deduplicated is a General NLP dataset in English in Parquet format.

About Salesforce/fineweb_deduplicated

TL;DR Fineweb is a popular and high quality open dataset. This dataset is a deduplicated version of Fineweb - removing rows with duplicate text, collecting counts. Motivation Fineweb is an open text dataset intended for training lan...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Creator
Salesforce
Year
2024
Download

Related General NLP datasets

FAQ