Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
arXiv (Cornell University)2020
Citations Over Time
Alena Butryna, Shan Hui Cathy Chu, İşin Demirşahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chen Fang Li, Tatiana Merkulova, Yin May Oo, Knot Pipatsrisawat, Clara E. Rivera, Supheakmungkol Sarin, Pasindu De Silva, Keshan Sodimana, Richard Sproat, Theeraphol Wattanavekin, Jaka Aris Eko Wibawa
Abstract
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
Related Papers
- Developing Text Resources for Ten South African Languages(2014)
- → Speech Technology for Information Access: a South African Case Study.(2010)28 cited
- Global Open Resources and Information for Language and Linguistic Analysis (GORILLA).(2016)
- → Philippine language resources(2009)6 cited
- → ON DOCUMENTING LOW RESOURCED INDIAN LANGUAGES INSIGHTS FROM KANAUJI SPEECH CORPUS(2017)3 cited