Common Voice: A Massively-Multilingual Speech Corpus

“Common Voice: A Massively-Multilingual Speech Corpus,” submitted to arXiv on December 13, 2019 by Rosana Ardila, Megan Branson, and colleagues at Mozilla, described a crowdsourced effort to build openly licensed speech data for many languages at once. Volunteers record short sentences and validate one another’s clips, producing a corpus that, at the time of the paper, spanned 29 languages and about 2,500 hours of audio from over 50,000 contributors.

The dataset is released into the public domain, deliberately countering the concentration of speech data inside a few large companies. The paper demonstrates transfer learning and reports recognition results across a dozen target languages. Common Voice has since grown well beyond its initial scope through ongoing contributions.

Why business readers should care: speech technology has long worked best in English and a handful of major languages because that is where the data was. Common Voice is a leading attempt to broaden that base, lowering the barrier for building voice products in underserved languages and giving researchers a freely usable benchmark.

Common Voice: A Massively-Multilingual Speech Corpus

Sources

Related