Abstract
Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.
Original language | English |
---|---|
Pages (from-to) | 88-99 |
Number of pages | 12 |
Journal | Journal of information processing |
Volume | 25 |
DOIs | |
Publication status | Published - 2017 |
Keywords
- Analogies
- BLEU
- Clustering
- Filtering
- Machine translation
- Quasi-parallel corpus
ASJC Scopus subject areas
- Computer Science(all)