emrQA: A Large Corpus for Question Answering on Electronic Medical Records

Anusri Pampari, Preethi Raghavan, Jennifer Liang, Jian Peng


Abstract
We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million questions-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.
Anthology ID:
D18-1258
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2357–2368
Language:
URL:
https://aclanthology.org/D18-1258
DOI:
10.18653/v1/D18-1258
Bibkey:
Cite (ACL):
Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2357–2368, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
emrQA: A Large Corpus for Question Answering on Electronic Medical Records (Pampari et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/D18-1258.pdf
Video:
 https://vimeo.com/305887077
Code
 panushri25/emrQA +  additional community code
Data
emrQADBpediaSQuAD