Question 1

What is the statmt/cc100 dataset?

Accepted Answer

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and pa...

Question 2

Is statmt/cc100 a benchmark?

Accepted Answer

statmt/cc100 is a dataset for training or evaluation; it isn't tracked as a standard LLM benchmark in our catalog.

Question 3

Where can I download statmt/cc100?

Accepted Answer

statmt/cc100 is available at its source: https://huggingface.co/datasets/statmt/cc100.

statmt/cc100

About statmt/cc100

Details

Related Text Generation, Fill Mask datasets

FAQ