Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages

William Chen, Brett Fazio


Abstract
Neural Machine Translation (NMT) for Low Resource Languages (LRL) is often limited by the lack of available training data, making it necessary to explore additional techniques to improve translation quality. We propose the use of the Prefix-Root-Postfix-Encoding (PRPE) subword segmentation algorithm to improve translation quality for LRLs, using two agglutinative languages as case studies: Quechua and Indonesian. During the course of our experiments, we reintroduce a parallel corpus for Quechua-Spanish translation that was previously unavailable for NMT. Our experiments show the importance of appropriate subword segmentation, which can go as far as improving translation quality over systems trained on much larger quantities of data. We show this by achieving state-of-the-art results for both languages, obtaining higher BLEU scores than large pre-trained models with much smaller amounts of data.
Anthology ID:
2021.mtsummit-loresmt.3
Volume:
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Month:
August
Year:
2021
Address:
Virtual
Venue:
LoResMT
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
20–31
Language:
URL:
https://aclanthology.org/2021.mtsummit-loresmt.3
DOI:
Bibkey:
Cite (ACL):
William Chen and Brett Fazio. 2021. Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 20–31, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages (Chen & Fazio, LoResMT 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2021.mtsummit-loresmt.3.pdf
Code
 wanchichen/morphological-nmt
Data
JW300