Data Augmentation for Low-Resource Neural Machine Translation

Marzieh Fadaee, Arianna Bisazza, Christof Monz


Abstract
The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.
Anthology ID:
P17-2090
Volume:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2017
Address:
Vancouver, Canada
Editors:
Regina Barzilay, Min-Yen Kan
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
567–573
Language:
URL:
https://aclanthology.org/P17-2090
DOI:
10.18653/v1/P17-2090
Bibkey:
Cite (ACL):
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data Augmentation for Low-Resource Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation for Low-Resource Neural Machine Translation (Fadaee et al., ACL 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-dup-bibkey/P17-2090.pdf
Presentation:
 P17-2090.Presentation.pdf
Code
 marziehf/DataAugmentationNMT