Complete Multilingual Neural Machine Translation

Markus Freitag, Orhan Firat


Abstract
Multilingual Neural Machine Translation (MNMT) models are commonly trained on a joint set of bilingual corpora which is acutely English-centric (i.e. English either as source or target language). While direct data between two languages that are non-English is explicitly available at times, its use is not common. In this paper, we first take a step back and look at the commonly used bilingual corpora (WMT), and resurface the existence and importance of implicit structure that existed in it: multi-way alignment across examples (the same sentence in more than two languages). We set out to study the use of multi-way aligned examples in order to enrich the original English-centric parallel corpora. We reintroduce this direct parallel data from multi-way aligned corpora between all source and target languages. By doing so, the English-centric graph expands into a complete graph, every language pair being connected. We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT) and demonstrate its utility and efficacy with a series of experiments and analysis. In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs. We further study the size effect of multi-way aligned data, its transfer learning capabilities and how it eases adding a new language in MNMT. Finally, we stress test cMNMT at scale and demonstrate that we can train a cMNMT model with up to 12,432 language pairs that provides competitive translation quality for all language pairs.
Anthology ID:
2020.wmt-1.66
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
550–560
Language:
URL:
https://aclanthology.org/2020.wmt-1.66
DOI:
Bibkey:
Cite (ACL):
Markus Freitag and Orhan Firat. 2020. Complete Multilingual Neural Machine Translation. In Proceedings of the Fifth Conference on Machine Translation, pages 550–560, Online. Association for Computational Linguistics.
Cite (Informal):
Complete Multilingual Neural Machine Translation (Freitag & Firat, WMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.wmt-1.66.pdf
Video:
 https://slideslive.com/38939550