0 citations0 references

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2022pp. 6097–6101

Citations Over Time

Hao Zhang, You-Chi Cheng, Shankar Kumar, W. Ronny Huang, Mingqing Chen, Rajiv Mathews

Abstract

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

Related Papers

Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling(2013)
→ Improved topic-dependent language modeling using information retrieval techniques(1999)55 cited
→ Verifying the long-range dependency of RNN language models(2016)2 cited
→ Going Wider: Recurrent Neural Network With Parallel Cells(2017)5 cited
→ Building Personalized Language Models Through Language Model Interpolation(2023)