A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner


Abstract
Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.
Anthology ID:
2021.naacl-main.365
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4599–4610
Language:
URL:
https://aclanthology.org/2021.naacl-main.365
DOI:
10.18653/v1/2021.naacl-main.365
Bibkey:
Cite (ACL):
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.
Cite (Informal):
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers (Dasigi et al., NAACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2021.naacl-main.365.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2021.naacl-main.365.mp4
Code
 allenai/qasper-led-baseline
Data
QASPERS2ORC