Xueguang Ma


2022

pdf bib
An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering
Minghan Li | Xueguang Ma | Jimmy Lin
Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022)

The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it is unclear how DPR’s question encoder and passage encoder individually contributes to overall performance, which we refer to as the encoder attribution problem. The problem is important as it helps us identify the factors that affect individual encoders to further improve overall performance. In this paper, we formulate our analysis under a probabilistic framework called encoder marginalization, where we quantify the contribution of a single encoder by marginalizing other variables. First, we find that the passage encoder contributes more than the question encoder to in-domain retrieval accuracy. Second, we demonstrate how to find the affecting factors for each encoder, where we train DPR with different amounts of data and use encoder marginalization to analyze the results. We find that positive passage overlap and corpus coverage of training data have big impacts on the passage encoder, while the question encoder is mainly affected by training sample complexity under this setting. Based on this framework, we can devise data-efficient training regimes: for example, we manage to train a passage encoder on SQuAD using 60% less training data without loss of accuracy.

2021

pdf
Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval
Xueguang Ma | Minghan Li | Kai Sun | Ji Xin | Jimmy Lin
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze the redundancy present in encoded dense vectors and show that the default dimension of 768 is unnecessarily large. To improve space efficiency, we propose a simple unsupervised compression pipeline that consists of principal component analysis (PCA), product quantization, and hybrid search. We further investigate other supervised baselines and find surprisingly that unsupervised PCA outperforms them in some settings. We perform extensive experiments on five question answering datasets and demonstrate that our best pipeline achieves good accuracy–space trade-offs, for example, 48× compression with less than 3% drop in top-100 retrieval accuracy on average or 96× compression with less than 4% drop. Code and data are available at http://pyserini.io/.

pdf
Scientific Claim Verification with VerT5erini
Ronak Pradeep | Xueguang Ma | Rodrigo Nogueira | Jimmy Lin
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain. We propose a system called VerT5erini that exploits T5 for abstract retrieval, sentence selection, and label prediction, which are three critical sub-tasks of claim verification. We evaluate our pipeline on SciFACT, a newly curated dataset that requires models to not just predict the veracity of claims but also provide relevant sentences from a corpus of scientific literature that support the prediction. Empirically, our system outperforms a strong baseline in each of the three sub-tasks. We further show VerT5erini’s ability to generalize to two new datasets of COVID-19 claims using evidence from the CORD-19 corpus.

pdf
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Xinyu Zhang | Xueguang Ma | Peng Shi | Jimmy Lin
Proceedings of the 1st Workshop on Multilingual Representation Learning

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call “mDPR”. Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse–dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.