Abstract
Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.- Anthology ID:
- 2022.coling-1.388
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 4407–4417
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.388
- DOI:
- Cite (ACL):
- Mengjiao Zhang and Jia Xu. 2022. Byte-based Multilingual NMT for Endangered Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4407–4417, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Byte-based Multilingual NMT for Endangered Languages (Zhang & Xu, COLING 2022)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2022.coling-1.388.pdf
- Code
- mengjiaozhang/byte-based-multilingual-nmt