Question 1

What is the legacy-datasets/common_voice dataset?

Accepted Answer

Common Voice is Mozilla's initiative to help teach machines how real people speak.
The dataset currently consists of 7,335 validated hours of speech in 60 languages, but we’re always adding more voices and languages.

Question 2

Is legacy-datasets/common_voice a benchmark?

Accepted Answer

legacy-datasets/common_voice is a dataset for training or evaluation; it isn't tracked as a standard LLM benchmark in our catalog.

Question 3

Where can I download legacy-datasets/common_voice?

Accepted Answer

legacy-datasets/common_voice is available at its source: https://huggingface.co/datasets/legacy-datasets/common_voice.

legacy-datasets/common_voice

About legacy-datasets/common_voice

Details

Related Automatic Speech Recognition datasets

FAQ