Data Augmentation for Low-Resource Neural Machine Translation

Marzieh Fadaee; Arianna Bisazza; Christof Monz

doi:10.18653/v1/P17-2090

Data Augmentation for Low-Resource Neural Machine Translation

Marzieh Fadaee, Arianna Bisazza, Christof Monz

Abstract

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.

Anthology ID:: P17-2090
Volume:: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2017
Address:: Vancouver, Canada
Editors:: Regina Barzilay, Min-Yen Kan
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 567–573
Language:
URL:: https://aclanthology.org/P17-2090
DOI:: 10.18653/v1/P17-2090
Bibkey:
Cite (ACL):: Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data Augmentation for Low-Resource Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):: Data Augmentation for Low-Resource Neural Machine Translation (Fadaee et al., ACL 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-dup-bibkey/P17-2090.pdf
Presentation:: P17-2090.Presentation.pdf
Code: marziehf/DataAugmentationNMT

PDF Search Code Presentation