Elahe Kalbassi
2023
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
Jean Maillard
|
Cynthia Gao
|
Elahe Kalbassi
|
Kaushik Ram Sadagopan
|
Vedanuj Goswami
|
Philipp Koehn
|
Angela Fan
|
Francisco Guzman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
For many languages, machine translation progress is hindered by the lack of reliable training data. Models are trained on whatever pre-existing datasets may be available and then augmented with synthetic data, because it is often not economical to pay for the creation of large-scale datasets. But for the case of low-resource languages, would the creation of a few thousand professionally translated sentence pairs give any benefit? In this paper, we show that it does.We describe a broad data collection effort involving around 6k professionally translated sentence pairs for each of 39 low-resource languages, which we make publicly available. We analyse the gains of models trained on this small but high-quality data, showing that it has significant impact even when larger but lower quality pre-existing corpora are used, or when data is augmented with millions of sentences through backtranslation.
Search
Co-authors
- Jean Maillard 1
- Cynthia Gao 1
- Kaushik Ram Sadagopan 1
- Vedanuj Goswami 1
- Philipp Koehn 1
- show all...
Venues
- acl1