Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?
Citations Over TimeTop 1% of 2021 papers
Abstract
Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.
Related Papers
- → A Spelling Correction Model for End-to-end Speech Recognition(2019)140 cited
- → Large Scale Distributed Acoustic Modeling With Back-Off ${\rm N}$-Grams(2013)6 cited
- → Distributed acoustic modeling with back-off n-grams(2012)5 cited
- → Online Adaptation of Language Models for Speech Recognition(2019)2 cited
- → The 1997 CMU Sphinx-3 English Broadcast News Transcription System(2022)42 cited