Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Luyu Gao; Jamie Callan

doi:10.18653/v1/2022.acl-long.203

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Abstract

Recent research demonstrates the effectiveness of using fine-tuned language models (LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i) fragility to training data noise and ii) requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, and the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

Anthology ID:: 2022.acl-long.203
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2843–2853
Language:
URL:: https://aclanthology.org/2022.acl-long.203
DOI:: 10.18653/v1/2022.acl-long.203
Bibkey:
Cite (ACL):: Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval (Gao & Callan, ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2022.acl-long.203.pdf
Video:: https://preview.aclanthology.org/naacl24-info/2022.acl-long.203.mp4
Code: luyug/Condenser
Data: MS MARCO, Natural Questions, TriviaQA

PDF Search Code Video