Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation

Nishant Kambhatla, Logan Born, Anoop Sarkar


Abstract
We propose a novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods. These alternate segmentations function like related languages in multilingual translation. Overall this improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich. We also introduce a cross-teaching technique which yields further improvements in translation accuracy and cross-lingual transfer between high- and low-resource language pairs. Compared to other strong multilingual baselines, our approach yields average gains of +1.7 BLEU across the four low-resource datasets from the multilingual TED-talks dataset. Our technique does not require additional training data and is a drop-in improvement for any existing neural translation system.
Anthology ID:
2022.eamt-1.16
Volume:
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2022
Address:
Ghent, Belgium
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
131–140
Language:
URL:
https://aclanthology.org/2022.eamt-1.16
DOI:
Bibkey:
Cite (ACL):
Nishant Kambhatla, Logan Born, and Anoop Sarkar. 2022. Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 131–140, Ghent, Belgium. European Association for Machine Translation.
Cite (Informal):
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation (Kambhatla et al., EAMT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.eamt-1.16.pdf