2022
pdf
abs
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
|
Isaac Caswell
|
Lisa Wang
|
Ahsan Wahab
|
Daan van Esch
|
Nasanbayar Ulzii-Orshikh
|
Allahsera Tapo
|
Nishant Subramani
|
Artem Sokolov
|
Claytone Sikasote
|
Monang Setyawan
|
Supheakmungkol Sarin
|
Sokhar Samb
|
Benoît Sagot
|
Clara Rivera
|
Annette Rios
|
Isabel Papadimitriou
|
Salomey Osei
|
Pedro Ortiz Suarez
|
Iroro Orife
|
Kelechi Ogueji
|
Andre Niyongabo Rubungo
|
Toan Q. Nguyen
|
Mathias Müller
|
André Müller
|
Shamsuddeen Hassan Muhammad
|
Nanda Muhammad
|
Ayanda Mnyakeni
|
Jamshidbek Mirzakhalov
|
Tapiwanashe Matangira
|
Colin Leong
|
Nze Lawson
|
Sneha Kudugunta
|
Yacine Jernite
|
Mathias Jenny
|
Orhan Firat
|
Bonaventure F. P. Dossou
|
Sakhile Dlamini
|
Nisansa de Silva
|
Sakine Çabuk Ballı
|
Stella Biderman
|
Alessia Battisti
|
Ahmed Baruwa
|
Ankur Bapna
|
Pallavi Baljekar
|
Israel Abebe Azime
|
Ayodele Awokoya
|
Duygu Ataman
|
Orevaoghene Ahia
|
Oghenefego Ahia
|
Sweta Agrawal
|
Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
2021
pdf
abs
MURAL: Multimodal, Multitask Representations Across Languages
Aashi Jain
|
Mandy Guo
|
Krishna Srinivasan
|
Ting Chen
|
Sneha Kudugunta
|
Chao Jia
|
Yinfei Yang
|
Jason Baldridge
Findings of the Association for Computational Linguistics: EMNLP 2021
Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al.)–a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL’s performance matches or exceeds ALIGN’s cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL’s text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
pdf
abs
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Sneha Kudugunta
|
Yanping Huang
|
Ankur Bapna
|
Maxim Krikun
|
Dmitry Lepikhin
|
Minh-Thang Luong
|
Orhan Firat
Findings of the Association for Computational Linguistics: EMNLP 2021
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.
2020
pdf
abs
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Aditya Siddhant
|
Ankur Bapna
|
Yuan Cao
|
Orhan Firat
|
Mia Chen
|
Sneha Kudugunta
|
Naveen Arivazhagan
|
Yonghui Wu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 33 BLEU on ro-en translation without any parallel data or back-translation.
2019
pdf
abs
Investigating Multilingual NMT Representations at Scale
Sneha Kudugunta
|
Ankur Bapna
|
Isaac Caswell
|
Orhan Firat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with 103 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a representation similarity framework that allows us to compare representations across different languages, layers and models. Our analysis validates several empirical results and long-standing intuitions, and unveils new observations regarding how representations evolve in a multilingual translation model. We draw three major results from our analysis, with implications on cross-lingual transfer learning: (i) Encoder representations of different languages cluster based on linguistic similarity, (ii) Representations of a source language learned by the encoder are dependent on the target language, and vice-versa, and (iii) Representations of high resource and/or linguistically similar languages are more robust when fine-tuning on an arbitrary language pair, which is critical to determining how much cross-lingual transfer can be expected in a zero or few-shot setting. We further connect our findings with existing empirical observations in multilingual NMT and transfer learning.