Rahul Agrawal
2020
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang
|
Nan Duan
|
Yeyun Gong
|
Ning Wu
|
Fenfei Guo
|
Weizhen Qi
|
Ming Gong
|
Linjun Shou
|
Daxin Jiang
|
Guihong Cao
|
Xiaodong Fan
|
Ruofei Zhang
|
Rahul Agrawal
|
Edward Cui
|
Sining Wei
|
Taroon Bharti
|
Ying Qiao
|
Jiun-Hung Chen
|
Winnie Wu
|
Shuguang Liu
|
Fan Yang
|
Daniel Campos
|
Rangan Majumder
|
Ming Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.
Search
Co-authors
- Yaobo Liang 1
- Nan Duan 1
- Yeyun Gong 1
- Ning Wu 1
- Fenfei Guo 1
- show all...