Low-Resource Neural Machine Translation: A Case Study of Cantonese

Evelyn Kai-Yan Liu


Abstract
The development of Natural Language Processing (NLP) applications for Cantonese, a language with over 85 million speakers, is lagging compared to other languages with a similar number of speakers. In this paper, we present, to our best knowledge, the first benchmark of multiple neural machine translation (NMT) systems from Mandarin Chinese to Cantonese. Additionally, we performed parallel sentence mining (PSM) as data augmentation for the extremely low resource language pair and increased the number of sentence pairs from 1,002 to 35,877. Results show that with PSM, the best performing model (BPE-level bidirectional LSTM) scored 11.98 BLEU better than the vanilla baseline and 9.93 BLEU higher than our strong baseline. Our unsupervised NMT (UNMT) results also refuted previous assumption n (Rubino et al., 2020) that the poor performance was related to the lack of linguistic similarities between the target and source languages, particularly in the case of Cantonese and Mandarin. In the process of building the NMT system, we also created the first large-scale parallel training and evaluation datasets of the language pair. Codes and datasets are publicly available at https://github.com/evelynkyl/yue_nmt.
Anthology ID:
2022.vardial-1.4
Volume:
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28–40
Language:
URL:
https://aclanthology.org/2022.vardial-1.4
DOI:
Bibkey:
Cite (ACL):
Evelyn Kai-Yan Liu. 2022. Low-Resource Neural Machine Translation: A Case Study of Cantonese. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 28–40, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Low-Resource Neural Machine Translation: A Case Study of Cantonese (Liu, VarDial 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.vardial-1.4.pdf
Code
 evelynkyl/yue_nmt