Complete Multilingual Neural Machine Translation

Markus Freitag; Orhan Firat

Complete Multilingual Neural Machine Translation

Abstract

Multilingual Neural Machine Translation (MNMT) models are commonly trained on a joint set of bilingual corpora which is acutely English-centric (i.e. English either as source or target language). While direct data between two languages that are non-English is explicitly available at times, its use is not common. In this paper, we first take a step back and look at the commonly used bilingual corpora (WMT), and resurface the existence and importance of implicit structure that existed in it: multi-way alignment across examples (the same sentence in more than two languages). We set out to study the use of multi-way aligned examples in order to enrich the original English-centric parallel corpora. We reintroduce this direct parallel data from multi-way aligned corpora between all source and target languages. By doing so, the English-centric graph expands into a complete graph, every language pair being connected. We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT) and demonstrate its utility and efficacy with a series of experiments and analysis. In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs. We further study the size effect of multi-way aligned data, its transfer learning capabilities and how it eases adding a new language in MNMT. Finally, we stress test cMNMT at scale and demonstrate that we can train a cMNMT model with up to 12,432 language pairs that provides competitive translation quality for all language pairs.

Anthology ID:: 2020.wmt-1.66
Volume:: Proceedings of the Fifth Conference on Machine Translation
Month:: November
Year:: 2020
Address:: Online
Venues:: EMNLP | WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 550–560
Language:
URL:: https://aclanthology.org/2020.wmt-1.66
DOI:
Bibkey:
Cite (ACL):: Markus Freitag and Orhan Firat. 2020. Complete Multilingual Neural Machine Translation. In Proceedings of the Fifth Conference on Machine Translation, pages 550–560, Online. Association for Computational Linguistics.
Cite (Informal):: Complete Multilingual Neural Machine Translation (Freitag & Firat, WMT 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2020.wmt-1.66.pdf
Video:: https://slideslive.com/38939550

PDF Cite Search Video