Domain-matched Pre-training Tasks for Dense Retrieval
Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Scott Yih, Sonal Gupta, Yashar Mehdad
Abstract
Pre-training on larger datasets with ever increasing model size isnow a proven recipe for increased performance across almost all NLP tasks.A notable exception is information retrieval, where additional pre-traininghas so far failed to produce convincing results. We show that, with theright pre-training setup, this barrier can be overcome. We demonstrate thisby pre-training large bi-encoder models on 1) a recently released set of 65 millionsynthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io.We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.- Anthology ID:
- 2022.findings-naacl.114
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2022
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1524–1534
- Language:
- URL:
- https://aclanthology.org/2022.findings-naacl.114
- DOI:
- 10.18653/v1/2022.findings-naacl.114
- Cite (ACL):
- Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Scott Yih, Sonal Gupta, and Yashar Mehdad. 2022. Domain-matched Pre-training Tasks for Dense Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1524–1534, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Domain-matched Pre-training Tasks for Dense Retrieval (Oguz et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.findings-naacl.114.pdf
- Code
- facebookresearch/dpr-scale
- Data
- ConvAI2, DSTC7 Task 1, KILT, MS MARCO, Natural Questions, PAQ