Mining for Domain-specific Parallel Text from Wikipedia
Citations Over TimeTop 13% of 2013 papers
Abstract
Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, we limit the search space to a specific topical domain, since our final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.
Related Papers
- Domain Adaptation via Pseudo In-Domain Data Selection(2011)
- Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment(2010)
- → Automatic Building and Using Parallel Resources for SMT from Comparable Corpora(2014)28 cited
- → A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation(2014)29 cited
- → Sentence alignment using local and global information(2016)9 cited