Abstract
Commonly used information retrieval methods such as TF-IDF in open-domain question answering (QA) systems are insufficient to capture deep semantic matching that goes beyond lexical overlaps. Some recent studies consider the retrieval process as maximum inner product search (MIPS) using dense question and paragraph representations, achieving promising results on several information-seeking QA datasets. However, the pretraining of the dense vector representations is highly resource-demanding, e.g., requires a very large batch size and lots of training steps. In this work, we propose a sample-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we use an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a simple progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three open-domain QA datasets, our method consistently outperforms a strong dense retrieval baseline that uses 6 times more computation for training. On two of the datasets, our method achieves more than 4-point absolute improvement in terms of answer exact match.- Anthology ID:
- 2021.eacl-main.244
- Volume:
- Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
- Month:
- April
- Year:
- 2021
- Address:
- Online
- Editors:
- Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2803–2815
- Language:
- URL:
- https://aclanthology.org/2021.eacl-main.244
- DOI:
- 10.18653/v1/2021.eacl-main.244
- Cite (ACL):
- Wenhan Xiong, Hong Wang, and William Yang Wang. 2021. Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2803–2815, Online. Association for Computational Linguistics.
- Cite (Informal):
- Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering (Xiong et al., EACL 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.eacl-main.244.pdf
- Code
- xwhan/ProQA
- Data
- Natural Questions, WebQuestions