Skip to content

WebOrganizer/Corpus-200B

General NLPEN

Created by WebOrganizer at 2025, the WebOrganizer/Corpus-200B is a General NLP dataset in EN in Parquet format.

About WebOrganizer/Corpus-200B

WebOrganizer/Corpus-200B [Paper] [Website] [GitHub] This dataset is a pre-processed version of the 1b-1x CommonCrawl pool from DataComps-LM cleaned with (1) RefinedWeb filters and (2) BFF deduplication. We provide the resulting 200B token corpu...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
N/A
Creator
WebOrganizer
Year
2025
Download

Related General NLP datasets

FAQ