Abstract
Large-scale parallel corpora are indispensable to train highly accurate machine translators. However, manually constructed large-scale parallel corpora are not freely available in many language pairs. In previous studies, training data have been expanded using a pseudo-parallel corpus obtained using machine translation of the monolingual corpus in the target language. However, in low-resource language pairs in which only low-accuracy machine translation systems can be used, translation quality is reduces when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using a quality estimation based on back-translation. As a result of experiments with three language pairs using small, medium, and large parallel corpora, language pairs with fewer training data filtered out more sentence pairs and improved BLEU scores more significantly.- Anthology ID:
- W17-5704
- Volume:
- Proceedings of the 4th Workshop on Asian Translation (WAT2017)
- Month:
- November
- Year:
- 2017
- Address:
- Taipei, Taiwan
- Editors:
- Toshiaki Nakazawa, Isao Goto
- Venue:
- WAT
- SIG:
- Publisher:
- Asian Federation of Natural Language Processing
- Note:
- Pages:
- 70–78
- Language:
- URL:
- https://aclanthology.org/W17-5704
- DOI:
- Cite (ACL):
- Aizhan Imankulova, Takayuki Sato, and Mamoru Komachi. 2017. Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pages 70–78, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Cite (Informal):
- Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus (Imankulova et al., WAT 2017)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/W17-5704.pdf
- Code
- aizhanti/filtered-pseudo-parallel-corpora