Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
Citations Over TimeTop 10% of 2022 papers
Abstract
Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. We move away from the popular one-for-all multilingual models and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. We focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We also combine supervised and self-supervised training, allowing encoders to take advantage of monolingual training data.Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 44 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders and mine bitexts. Adding these mined bitexts yielded an improvement of 3.8 BLEU for NMT into English.
Related Papers
- → Romans 12.4–8: One Sentence or Two?(2006)2 cited
- The Pragmatic Analysis of the Subject of "Bei"——sentence in Dunuang Bianwen(2010)
- Design and Realization of RS Encoder Based on FPGA(2009)
- → A(2023)