Abstract
We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million questions-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.- Anthology ID:
- D18-1258
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2357–2368
- Language:
- URL:
- https://aclanthology.org/D18-1258
- DOI:
- 10.18653/v1/D18-1258
- Cite (ACL):
- Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2357–2368, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- emrQA: A Large Corpus for Question Answering on Electronic Medical Records (Pampari et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/D18-1258.pdf
- Code
- panushri25/emrQA + additional community code
- Data
- emrQA, DBpedia, SQuAD