Abstract
Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.- Anthology ID:
- 2023.findings-emnlp.803
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11991–12008
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.803
- DOI:
- 10.18653/v1/2023.findings-emnlp.803
- Cite (ACL):
- Fan Jiang, Tom Drummond, and Trevor Cohn. 2023. Noisy Self-Training with Synthetic Queries for Dense Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11991–12008, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Noisy Self-Training with Synthetic Queries for Dense Retrieval (Jiang et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-emnlp.803.pdf