Abstract
Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search. In this work, we tackle the challenge of fine-tuning these models for specific domains in a data and computationally efficient manner. Typically, researchers fine-tune models using corpus-specific labeled data from sources such as TREC. We first answer the question: How much data of this type do we need? Recognizing that the most computationally efficient training is no training, we explore zero-shot ranking using BERT models that have already been fine-tuned with the large MS MARCO passage retrieval dataset. We arrive at the surprising and novel finding that “some” labeled in-domain data can be worse than none at all.- Anthology ID:
- 2020.sustainlp-1.14
- Volume:
- Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glavaš, Shafiq Joty, Alex Wang, Thomas Wolf
- Venue:
- sustainlp
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 107–112
- Language:
- URL:
- https://aclanthology.org/2020.sustainlp-1.14
- DOI:
- 10.18653/v1/2020.sustainlp-1.14
- Cite (ACL):
- Xinyu Zhang, Andrew Yates, and Jimmy Lin. 2020. A Little Bit Is Worse Than None: Ranking with Limited Training Data. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 107–112, Online. Association for Computational Linguistics.
- Cite (Informal):
- A Little Bit Is Worse Than None: Ranking with Limited Training Data (Zhang et al., sustainlp 2020)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2020.sustainlp-1.14.pdf
- Data
- MS MARCO