WebOrganizer/Corpus-200B
General NLPEN
Created by WebOrganizer at 2025, the WebOrganizer/Corpus-200B is a General NLP dataset in EN in Parquet format.
About WebOrganizer/Corpus-200B
WebOrganizer/Corpus-200B
[Paper] [Website] [GitHub]
This dataset is a pre-processed version of the 1b-1x CommonCrawl pool from DataComps-LM cleaned with
(1) RefinedWeb filters and
(2) BFF deduplication.
We provide the resulting 200B token corpu...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- WebOrganizer
- Year
- 2025