Xia Song


2021

pdf bib
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training
Zewen Chi | Li Dong | Furu Wei | Nan Yang | Saksham Singhal | Wenhui Wang | Xia Song | Xian-Ling Mao | Heyan Huang | Ming Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm.

pdf bib
Language Scaling for Universal Suggested Replies Model
Qianlan Ying | Payal Bajaj | Budhaditya Deb | Yu Yang | Wei Wang | Bojia Lin | Milad Shokouhi | Xia Song | Yang Yang | Daxin Jiang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

We consider the problem of scaling automated suggested replies for a commercial email application to multiple languages. Faced with increased compute requirements and low language resources for language expansion, we build a single universal model for improving the quality and reducing run-time costs of our production system. However, restricted data movement across regional centers prevents joint training across languages. To this end, we propose a multi-lingual multi-task continual learning framework, with auxiliary tasks and language adapters to train universal language representation across regions. The experimental results show positive cross-lingual transfer across languages while reducing catastrophic forgetting across regions. Our online results on real user traffic show significant CTR and Char-saved gain as well as 65% training cost reduction compared with per-language models. As a consequence, we have scaled the feature in multiple languages including low-resource markets.

pdf bib
Consistency Regularization for Cross-Lingual Fine-Tuning
Bo Zheng | Li Dong | Shaohan Huang | Wenhui Wang | Zewen Chi | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.

pdf bib
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
Zewen Chi | Li Dong | Shuming Ma | Shaohan Huang | Saksham Singhal | Xian-Ling Mao | Heyan Huang | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multilingual T5 pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on seven multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

pdf bib
Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training
Bo Zheng | Li Dong | Shaohan Huang | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

pdf bib
Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task
Jian Yang | Shuming Ma | Haoyang Huang | Dongdong Zhang | Li Dong | Shaohan Huang | Alexandre Muzio | Saksham Singhal | Hany Hassan | Xia Song | Furu Wei
Proceedings of the Sixth Conference on Machine Translation

This report describes Microsoft’s machine translation systems for the WMT21 shared task on large-scale multilingual machine translation. We participated in all three evaluation tracks including Large Track and two Small Tracks where the former one is unconstrained and the latter two are fully constrained. Our model submissions to the shared task were initialized with DeltaLM, a generic pre-trained multilingual encoder-decoder model, and fine-tuned correspondingly with the vast collected parallel data and allowed data sources according to track settings, together with applying progressive learning and iterative back-translation approaches to further improve the performance. Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.

2019

pdf bib
Towards Language Agnostic Universal Representations
Armen Aghajanyan | Xia Song | Saurabh Tiwary
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in, even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics, specifically the Universal Grammar hypothesis and learn universal latent representations that are language agnostic. We demonstrate the capabilities of these representations by showing that models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.