Skip to content

PleIAs/common_corpus

General NLPEN, FR, DE

Created by PleIAs at 2024, the PleIAs/common_corpus is a General NLP dataset in EN, FR, DE in Parquet format.

About PleIAs/common_corpus

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, g...

Details

Task
General NLP
Language
EN, FR, DE
Format
Parquet
Rows / instances
N/A
Creator
PleIAs
Year
2024
Download

Related General NLP datasets

FAQ