Abstract
Multilingual neural machine translation aims to encapsulate multiple languages into a single model. However, it requires an enormous dataset, leaving the low-resource language (LRL) underdeveloped. As LRLs may benefit from shared knowledge of multilingual representation, we aspire to find effective ways to integrate unseen languages in a pre-trained model. Nevertheless, the intricacy of shared representation among languages hinders its full utilisation. To resolve this problem, we employed target language prediction and a central language-aware layer to improve representation in integrating LRLs. Focusing on improving LRLs in the linguistically diverse country of Indonesia, we evaluated five languages using a parallel corpus of 1,000 instances each, with experimental results measured by BLEU showing zero-shot improvement of 7.4 from the baseline score of 7.1 to a score of 15.5 at best. Further analysis showed that the gains in performance are attributed more to the disentanglement of multilingual representation in the encoder with the shift of the target language-specific representation in the decoder.- Anthology ID:
- 2024.lrec-main.446
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 4978–4989
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.446
- DOI:
- Cite (ACL):
- Frederikus Hudi, Zhi Qu, Hidetaka Kamigaito, and Taro Watanabe. 2024. Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4978–4989, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation (Hudi et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2024.lrec-main.446.pdf