Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation
Abstract
Although their use in training quality machine translation systems has been proven, parallel corpora - large collections of translated texts - are generally hard to come by for the majority of languages. To counteract this fact, a relatively small collection may be processed in more depth by further cleaning and more accurately splitting and aligning the texts. We apply this to an existing English/Zulu parallel corpus that has been used for statistical machine translation experiments. After these preprocessing steps, we run the same experiments for comparative purposes. Our results suggest that compatibility of bitexts, the choice of sentence splitters used on different parts of the text, as well as manual work, may have a notable effect on both the corpus size and on automatic translation quality.
Related Papers
- → Design and Testing of Automatic Machine Translation System Based on Chinese-English Phrase Translation(2021)7 cited
- Statistical Machine Translation System(2009)
- A Hybrid Approach to Example based Machine Translation for Indian Languages(2007)
- Translating technical texts into Zulu with the aid of multilingual and/or parallel corpora : parallel / bilingual corpora(2004)
- Основные факторы улучшения машинного перевода(2015)