Abstract
Low-resource machine translation (LRMT) poses a substantial challenge due to the scarcity of parallel training data. This paper introduces a new method to improve the transfer of the embedding layer from the Parent model to the Child model in LRMT, utilizing trained token embeddings in the Parent model’s high-resource vocabulary. Our approach involves projecting all tokens into a shared semantic space and measuring the semantic similarity between tokens in the low-resource and high-resource languages. These measures are then utilized to initialize token representations in the Child model’s low-resource vocabulary. We evaluated our approach on three benchmark datasets of low-resource language pairs: Myanmar-English, Indonesian-English, and Turkish-English. The experimental results demonstrate that our method outperforms previous methods regarding translation quality. Additionally, our approach is computationally efficient, leading to reduced training time compared to prior works.- Anthology ID:
- 2023.mtsummit-research.11
- Volume:
- Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
- Month:
- September
- Year:
- 2023
- Address:
- Macau SAR, China
- Editors:
- Masao Utiyama, Rui Wang
- Venue:
- MTSummit
- SIG:
- Publisher:
- Asia-Pacific Association for Machine Translation
- Note:
- Pages:
- 123–134
- Language:
- URL:
- https://aclanthology.org/2023.mtsummit-research.11
- DOI:
- Cite (ACL):
- Van Hien Tran, Chenchen Ding, Hideki Tanaka, and Masao Utiyama. 2023. Improving Embedding Transfer for Low-Resource Machine Translation. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 123–134, Macau SAR, China. Asia-Pacific Association for Machine Translation.
- Cite (Informal):
- Improving Embedding Transfer for Low-Resource Machine Translation (Tran et al., MTSummit 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.mtsummit-research.11.pdf