Elahe Kalbassi


2023

pdf
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
Jean Maillard | Cynthia Gao | Elahe Kalbassi | Kaushik Ram Sadagopan | Vedanuj Goswami | Philipp Koehn | Angela Fan | Francisco Guzman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

For many languages, machine translation progress is hindered by the lack of reliable training data. Models are trained on whatever pre-existing datasets may be available and then augmented with synthetic data, because it is often not economical to pay for the creation of large-scale datasets. But for the case of low-resource languages, would the creation of a few thousand professionally translated sentence pairs give any benefit? In this paper, we show that it does.We describe a broad data collection effort involving around 6k professionally translated sentence pairs for each of 39 low-resource languages, which we make publicly available. We analyse the gains of models trained on this small but high-quality data, showing that it has significant impact even when larger but lower quality pre-existing corpora are used, or when data is augmented with millions of sentences through backtranslation.