PleIAs/common_corpus
General NLPEN, FR, DE
Created by PleIAs at 2024, the PleIAs/common_corpus is a General NLP dataset in EN, FR, DE in Parquet format.
About PleIAs/common_corpus
Common Corpus
Full paper - ICLR 2026 oral
Common Corpus is the largest open licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, g...
Details
- Task
- General NLP
- Language
- EN, FR, DE
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- PleIAs
- Year
- 2024