AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages

Machel Reid, Junjie Hu, Graham Neubig, Yutaka Matsuo


Abstract
Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.
Anthology ID:
2021.emnlp-main.99
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1306–1320
Language:
URL:
https://aclanthology.org/2021.emnlp-main.99
DOI:
10.18653/v1/2021.emnlp-main.99
Bibkey:
Cite (ACL):
Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. 2021. AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1306–1320, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages (Reid et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2021.emnlp-main.99.pdf
Video:
 https://preview.aclanthology.org/auto-file-uploads/2021.emnlp-main.99.mp4
Code
 machelreid/afromt
Data
OpenSubtitles