Ning Wu


Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval
Houxing Ren | Linjun Shou | Jian Pei | Ning Wu | Ming Gong | Daxin Jiang
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.

基于语料的“一+形容词+量词+名词”构式语义考察(A Semantic Study of “One-Adjective-Quantifier-Noun” Based on Corpus)
Ning Wu (吴宁) | Zhimin Wang (王治敏)
Proceedings of the 21st Chinese National Conference on Computational Linguistics


Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval
Houxing Ren | Linjun Shou | Ning Wu | Ming Gong | Daxin Jiang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.


XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang | Nan Duan | Yeyun Gong | Ning Wu | Fenfei Guo | Weizhen Qi | Ming Gong | Linjun Shou | Daxin Jiang | Guihong Cao | Xiaodong Fan | Ruofei Zhang | Rahul Agrawal | Edward Cui | Sining Wei | Taroon Bharti | Ying Qiao | Jiun-Hung Chen | Winnie Wu | Shuguang Liu | Fan Yang | Daniel Campos | Rangan Majumder | Ming Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.