Evaluating Modeling Units and Sub-word Features in Language Models for Turkish ASR
Citations Over Time
Abstract
Turkish is a morphologically rich language, which leads to serious data sparsity problems in language modeling for automatic speech recognition (ASR) tasks. Using sub-words as language modeling units and incorporating sub-word features into word-level models are two strategies to alleviate the problem. In this paper, we propose a novel model architecture which can incorporate sub-word features directly. And we use a CNN to learn (sub)word embeddings as sub-word features from character or sub-word level input. We evaluate the proposed model on Turkish ASR task. We choose word and morph (sub-word) as language modeling unit respectively. Results show that the consistency between language modeling units and ASR system units is important for the effectiveness of rescoring. And the proposed method reduces the word error rate (WER) of word and statistical sub-word level system by absolute 1.56% and 1.87%.
Related Papers
- → Language Model Adaptation For Statistical Machine Translation Based On Information Retrieval(2004)77 cited
- → Random forests and the data sparseness problem in language modeling(2006)62 cited
- Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition(2012)
- → An empirical study of statistical language models: n-gram language models vs. neural network language models(2018)10 cited
- → Trends and challenges in language modeling for speech recognition and machine translation(2009)3 cited