Harvesting Paragraph-level Question-Answer Pairs from Wikipedia

Xinya Du, Claire Cardie


Abstract
We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. As compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We provide qualitative analysis for the this large-scale generated corpus from Wikipedia.
Anthology ID:
P18-1177
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Iryna Gurevych, Yusuke Miyao
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1907–1917
Language:
URL:
https://aclanthology.org/P18-1177
DOI:
10.18653/v1/P18-1177
Bibkey:
Cite (ACL):
Xinya Du and Claire Cardie. 2018. Harvesting Paragraph-level Question-Answer Pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1907–1917, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Harvesting Paragraph-level Question-Answer Pairs from Wikipedia (Du & Cardie, ACL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/P18-1177.pdf
Note:
 P18-1177.Notes.pdf
Video:
 https://preview.aclanthology.org/add_acl24_videos/P18-1177.mp4
Code
 xinyadu/harvestingQA
Data
SQuADSimpleQuestionsWebQuestions