Latent Retrieval for Weakly Supervised Open Domain Question Answering

Kenton Lee, Ming-Wei Chang, Kristina Toutanova


Abstract
Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.
Anthology ID:
P19-1612
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6086–6096
Language:
URL:
https://aclanthology.org/P19-1612
DOI:
10.18653/v1/P19-1612
Bibkey:
Cite (ACL):
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Latent Retrieval for Weakly Supervised Open Domain Question Answering (Lee et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/P19-1612.pdf
Code
 google-research/language +  additional community code
Data
Natural QuestionsSQuADTriviaQAWebQuestions