Skip to content

statmt/cc100

Text GenerationFill MaskAF, AM, AR

The statmt/cc100 dataset is a AF, AM, AR text generation resource from statmt at 2022.

About statmt/cc100

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and pa...

Details

Task
Text Generation, Fill Mask
Language
AF, AM, AR
Format
Parquet
Rows / instances
N/A
Creator
statmt
Year
2022
Download

Related Text Generation, Fill Mask datasets

FAQ