Abstract
Neural Machine Translation (NMT) models are trained on parallel corpora with unbalanced word frequency distribution. As a result, NMT models are likely to prefer high-frequency words than low-frequency ones despite low-frequency word may carry the crucial semantic information, which may hamper the translation quality once they are neglected. The objective of this study is to enhance the translation of meaningful but low-frequency words. Our general idea is to optimize the translation of low-frequency words through knowledge distillation. Specifically, we employ a low-frequency teacher model that excels in translating low-frequency words to guide the learning of the student model. To remain the translation quality of high-frequency words, we further introduce a dual-teacher distillation framework, leveraging both the low-frequency and high-frequency teacher models to guide the student model’s training. Our single-teacher distillation method already achieves a +0.64 BLEU improvements over the state-of-the-art method on the WMT 16 English-to-German translation task on the low-frequency test set. While our dual-teacher framework leads to +0.87, +1.24, +0.47, +0.87 and +0.86 BLEU improvements on the IWSLT 14 German-to-English, WMT 16 English-to-German, WMT 15 English-to-Czech, WMT 14 English-to-French and WMT 18 Chinese-to-English tasks respectively compared to the baseline, while maintaining the translation performance of high-frequency words.- Anthology ID:
- 2024.findings-emnlp.317
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5543–5552
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.317/
- DOI:
- 10.18653/v1/2024.findings-emnlp.317
- Cite (ACL):
- Yifan Guo, Hongying Zan, and Hongfei Xu. 2024. Dual-teacher Knowledge Distillation for Low-frequency Word Translation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5543–5552, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Dual-teacher Knowledge Distillation for Low-frequency Word Translation (Guo et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.317.pdf