Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean

Francis Zheng, Edison Marrese-Taylor, Yutaka Matsuo


Abstract
Jejueo is a critically endangered language spoken on Jeju Island and is closely related to but mutually unintelligible with Korean. Parallel data between Jejueo and Korean is scarce, and translation between the two languages requires more attention, as current neural machine translation systems typically rely on large amounts of parallel training data. While low-resource machine translation has been shown to benefit from using additional monolingual data during the pretraining process, not as much research has been done on how to select languages other than the source and target languages for use during pretraining. We show that using large amounts of Korean and Japanese data during the pretraining process improves translation by 2.16 BLEU points for translation in the Jejueo → Korean direction and 1.34 BLEU points for translation in the Korean → Jejueo direction compared to the baseline.
Anthology ID:
2022.wat-1.3
Original:
2022.wat-1.3v1
Version 2:
2022.wat-1.3v2
Volume:
Proceedings of the 9th Workshop on Asian Translation
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
WAT
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
44–50
Language:
URL:
https://aclanthology.org/2022.wat-1.3
DOI:
Bibkey:
Cite (ACL):
Francis Zheng, Edison Marrese-Taylor, and Yutaka Matsuo. 2022. Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean. In Proceedings of the 9th Workshop on Asian Translation, pages 44–50, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean (Zheng et al., WAT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2022.wat-1.3.pdf
Data
JIT Dataset