Noisy Pair Corrector for Dense Retrieval

Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo


Abstract
Most dense retrieval models contain an implicit assumption: the training query-document pairs are exactly matched. Since it is expensive to annotate the corpus manually, training pairs in real-world applications are usually collected automatically, which inevitably introduces mismatched-pair noise. In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS. Experimental results show that NPC achieves excellent performance in handling both synthetic and realistic noise.
Anthology ID:
2023.findings-emnlp.765
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11439–11451
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2023.findings-emnlp.765/
DOI:
10.18653/v1/2023.findings-emnlp.765
Bibkey:
Cite (ACL):
Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, and Jian Guo. 2023. Noisy Pair Corrector for Dense Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11439–11451, Singapore. Association for Computational Linguistics.
Cite (Informal):
Noisy Pair Corrector for Dense Retrieval (Zhang et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2023.findings-emnlp.765.pdf