Pretrained Bidirectional Distillation for Machine Translation

Yimeng Zhuang; Mei Tu

doi:10.18653/v1/2023.acl-long.63

Pretrained Bidirectional Distillation for Machine Translation

Abstract

Knowledge transfer can boost neural machine translation (NMT), for example, by finetuning a pretrained masked language model (LM). However, it may suffer from the forgetting problem and the structural inconsistency between pretrained LMs and NMT models. Knowledge distillation (KD) may be a potential solution to alleviate these issues, but few studies have investigated language knowledge transfer from pretrained language models to NMT models through KD. In this paper, we propose Pretrained Bidirectional Distillation (PBD) for NMT, which aims to efficiently transfer bidirectional language knowledge from masked language pretraining to NMT models. Its advantages are reflected in efficiency and effectiveness through a globally defined and bidirectional context-aware distillation objective. Bidirectional language knowledge of the entire sequence is transferred to an NMT model concurrently during translation training. Specifically, we propose self-distilled masked language pretraining to obtain the PBD objective. We also design PBD losses to efficiently distill the language knowledge, in the form of token probabilities, to the encoder and decoder of an NMT model using the PBD objective. Extensive experiments reveal that pretrained bidirectional distillation can significantly improve machine translation performance and achieve competitive or even better results than previous pretrain-finetune or unified multilingual translation methods in supervised, unsupervised, and zero-shot scenarios. Empirically, it is concluded that pretrained bidirectional distillation is an effective and efficient method for transferring language knowledge from pretrained language models to NMT models.

Anthology ID:: 2023.acl-long.63
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1132–1145
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.63/
DOI:: 10.18653/v1/2023.acl-long.63
Bibkey:
Cite (ACL):: Yimeng Zhuang and Mei Tu. 2023. Pretrained Bidirectional Distillation for Machine Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1132–1145, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Pretrained Bidirectional Distillation for Machine Translation (Zhuang & Tu, ACL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.63.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.63.mp4

PDF Cite Search Video Fix data