HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints

Sahana Ramnath, Melvin Johnson, Abhirut Gupta, Aravindan Raghuveer


Abstract
Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve effectiveness of the available BT data, we introduce HintedBT—a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using both high and low quality BT data by providing hints (as source tags on the encoder) to the model about the quality of each source-target pair. We don’t filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as target tags on the decoder) that provide information about the operation required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: Hindi,Gujarati,Tamil-to-English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.
Anthology ID:
2021.emnlp-main.129
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1717–1733
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.129/
DOI:
10.18653/v1/2021.emnlp-main.129
Bibkey:
Cite (ACL):
Sahana Ramnath, Melvin Johnson, Abhirut Gupta, and Aravindan Raghuveer. 2021. HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1717–1733, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints (Ramnath et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.129.pdf
Video:
 https://preview.aclanthology.org/build-pipeline-with-new-library/2021.emnlp-main.129.mp4