Francis Zheng


2024

pdf
Improving Low-Resource Machine Translation for Formosan Languages Using Bilingual Lexical Resources
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Findings of the Association for Computational Linguistics ACL 2024

This paper investigates how machine translation for low-resource languages can be improved by incorporating information from bilingual lexicons during the training process for mainly translation between Mandarin and Formosan languages, which are all moribund or critically endangered, and we also show that our techniques work for translation between Spanish and Nahuatl, a language pair consisting of languages from completely different language families. About 70% of the approximately 7,000 languages of the world have data in the form of lexicons, a valuable resource for improving low-resource language translation. We collect a dataset of parallel data and bilingual lexicons between Mandarin and 16 different Formosan languages and examine mainly three different approaches: (1) simply using lexical data as additional parallel data, (2) generating pseudo-parallel sentence data to use during training by replacing words in the original parallel sentence data using the lexicon, and (3) a combination of (1) and (2). All three approaches give us gains in both Bleu scores and chrF scores, and we found that (3) provided the most gains, followed by (1) and then (2), which we observed for both translation between Mandarin and the Formosan languages and Spanish-Nahuatl. With technique (3), we saw an average increase of 5.55 in Bleu scores and 10.33 in chrF scores.

2022

pdf
Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 9th Workshop on Asian Translation

Jejueo is a critically endangered language spoken on Jeju Island and is closely related to but mutually unintelligible with Korean. Parallel data between Jejueo and Korean is scarce, and translation between the two languages requires more attention, as current neural machine translation systems typically rely on large amounts of parallel training data. While low-resource machine translation has been shown to benefit from using additional monolingual data during the pretraining process, not as much research has been done on how to select languages other than the source and target languages for use during pretraining. We show that using large amounts of Korean and Japanese data during the pretraining process improves translation by 2.16 BLEU points for translation in the Jejueo → Korean direction and 1.34 BLEU points for translation in the Korean → Jejueo direction compared to the baseline.

pdf
A Parallel Corpus and Dictionary for Amis-Mandarin Translation
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Amis is an endangered language indigenous to Taiwan with limited data available for computational processing. We thus present an Amis-Mandarin dataset containing a parallel corpus of 5,751 Amis and Mandarin sentences and a dictionary of 7,800 Amis words and phrases with their definitions in Mandarin. Using our dataset, we also established a baseline for machine translation between Amis and Mandarin in both directions. Our dataset can be found at https://github.com/francisdzheng/amis-mandarin.

2021

pdf
Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining
Francis Zheng | Machel Reid | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.