PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation

Binu Jasim, Vinay Namboodiri, C V Jawahar


Abstract
Data Augmentation methods for Neural Machine Translation (NMT) such as back- translation (BT) and self-training (ST) are quite popular. In a multilingual NMT system, simply copying monolingual source sentences to the target (Copying) is an effective data augmentation method. Back-translation aug- ments parallel data by translating monolingual sentences in the target side to source language. In this work we propose to use a partial back- translation method in a multilingual setting. Instead of translating the entire monolingual target sentence back into the source language, we replace selected high confidence phrases only and keep the rest of the words in the target language itself. (We call this method PhraseOut). Our experiments on low resource multilingual translation models show that PhraseOut gives reasonable improvements over the existing data augmentation methods.
Anthology ID:
2020.icon-main.63
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2020
Address:
Indian Institute of Technology Patna, Patna, India
Editors:
Pushpak Bhattacharyya, Dipti Misra Sharma, Rajeev Sangal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
470–474
Language:
URL:
https://aclanthology.org/2020.icon-main.63
DOI:
Bibkey:
Cite (ACL):
Binu Jasim, Vinay Namboodiri, and C V Jawahar. 2020. PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 470–474, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation (Jasim et al., ICON 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.icon-main.63.pdf