Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

Mitchell Gordon; Kevin Duh

doi:10.18653/v1/2020.ngt-1.12

Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

Abstract

We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.

Anthology ID:: 2020.ngt-1.12
Volume:: Proceedings of the Fourth Workshop on Neural Generation and Translation
Month:: July
Year:: 2020
Address:: Online
Editors:: Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Kenneth Heafield, Marcin Junczys-Dowmunt, Ioannis Konstas, Xian Li, Graham Neubig, Yusuke Oda
Venue:: NGT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 110–118
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.ngt-1.12/
DOI:: 10.18653/v1/2020.ngt-1.12
Bibkey:
Cite (ACL):: Mitchell Gordon and Kevin Duh. 2020. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 110–118, Online. Association for Computational Linguistics.
Cite (Informal):: Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation (Gordon & Duh, NGT 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.ngt-1.12.pdf
Video:: http://slideslive.com/38929825
Data: OpenSubtitles

PDF Cite Search Video Fix data