DENTRA: Denoising and Translation Pre-training for Multilingual Machine Translation

Samta Kamboj, Sunil Kumar Sahu, Neha Sengupta


Abstract
In this paper, we describe our submission to the WMT-2022: Large-Scale Machine Translation Evaluation for African Languages under the Constrained Translation track. We introduce DENTRA, a novel pre-training strategy for a multilingual sequence-to-sequence transformer model. DENTRA pre-training combines denoising and translation objectives to incorporate both monolingual and bitext corpora in 24 African, English, and French languages. To evaluate the quality of DENTRA, we fine-tuned it with two multilingual machine translation configurations, one-to-many and many-to-one. In both pre-training and fine-tuning, we employ only the datasets provided by the organizers. We compare DENTRA against a strong baseline, M2M-100, in different African multilingual machine translation scenarios and show gains in 3 out of 4 subtasks.
Anthology ID:
2022.wmt-1.103
Volume:
Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1057–1067
Language:
URL:
https://aclanthology.org/2022.wmt-1.103
DOI:
Bibkey:
Cite (ACL):
Samta Kamboj, Sunil Kumar Sahu, and Neha Sengupta. 2022. DENTRA: Denoising and Translation Pre-training for Multilingual Machine Translation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1057–1067, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
DENTRA: Denoising and Translation Pre-training for Multilingual Machine Translation (Kamboj et al., WMT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.wmt-1.103.pdf