CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data

Kung Hong; Lifeng Han; Riza Theresa Batista-Navarro; Goran Nenadic

CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data

Kung Hong, Lifeng Han, Riza Batista-Navarro, Goran Nenadic

Abstract

Neural Machine Translation (NMT) for low-resource languages remains a challenge for many NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction, i.e., Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation by three models: OpusMT, NLLB, and mBART.We carried out automatic evaluation using a range of different metrics including those that are lexical-based and embedding-based.Furthermore, we create a user-friendly interface for the models we included in this project, CantonMT, and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models to this platform via our open-source CantonMT toolkit, available at https://github.com/kenrickkung/CantoneseTranslation.

Anthology ID:: 2024.eamt-1.49
Volume:: Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Month:: June
Year:: 2024
Address:: Sheffield, UK
Editors:: Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation (EAMT)
Note:
Pages:: 590–599
Language:
URL:: https://preview.aclanthology.org/Author-page-Marten-During-lu/2024.eamt-1.49/
DOI:
Bibkey:
Cite (ACL):: Kung Hong, Lifeng Han, Riza Batista-Navarro, and Goran Nenadic. 2024. CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 590–599, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):: CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data (Hong et al., EAMT 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/Author-page-Marten-During-lu/2024.eamt-1.49.pdf

PDF Cite Search Fix data