Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT

Annie Lamar; Zeyneb Kaya

doi:10.18653/v1/2023.loresmt-1.8

Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT

Abstract

Data augmentation (DA) is a popular strategy to boost performance on neural machine translation tasks. The impact of data augmentation in low-resource environments, particularly for diverse and scarce languages, is understudied. In this paper, we introduce a simple yet novel metric to measure the impact of several different data augmentation strategies. This metric, which we call Data Augmentation Advantage (DAA), quantifies how many true data pairs a synthetic data pair is worth in a particular experimental context. We demonstrate the utility of this metric by training models for several linguistically-varied datasets using the data augmentation methods of back-translation, SwitchOut, and sentence concatenation. In lower-resource tasks, DAA is an especially valuable metric for comparing DA performance as it provides a more effective way to quantify gains when BLEU scores are especially small and results across diverse languages are more divergent and difficult to assess.

Anthology ID:: 2023.loresmt-1.8
Volume:: Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jade Abbott, Jonathan Washington, Nathaniel Oco, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao
Venue:: LoResMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 101–109
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2023.loresmt-1.8/
DOI:: 10.18653/v1/2023.loresmt-1.8
Bibkey:
Cite (ACL):: Annie Lamar and Zeyneb Kaya. 2023. Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT. In Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023), pages 101–109, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT (Lamar & Kaya, LoResMT 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2023.loresmt-1.8.pdf
Video:: https://preview.aclanthology.org/fix-sig-urls/2023.loresmt-1.8.mp4

PDF Cite Search Video Fix data