Abstract
We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.- Anthology ID:
- 2020.ngt-1.12
- Volume:
- Proceedings of the Fourth Workshop on Neural Generation and Translation
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Venue:
- NGT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 110–118
- Language:
- URL:
- https://aclanthology.org/2020.ngt-1.12
- DOI:
- 10.18653/v1/2020.ngt-1.12
- Cite (ACL):
- Mitchell Gordon and Kevin Duh. 2020. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 110–118, Online. Association for Computational Linguistics.
- Cite (Informal):
- Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation (Gordon & Duh, NGT 2020)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/2020.ngt-1.12.pdf
- Data
- OpenSubtitles