legacy-datasets/mc4
Text GenerationFill MaskAF, AM, AR
Legacy-datasets/mc4 is a text generation-focused dataset in AF, AM, AR distributed in Parquet format.
About legacy-datasets/mc4
A colossal, cleaned version of Common Crawl's web crawl corpus.
Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's mC4 dataset by AllenAI.
Details
- Task
- Text Generation, Fill Mask
- Language
- AF, AM, AR
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- legacy-datasets
- Year
- 2022