Abstract
One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.- Anthology ID:
- 2020.clinicalnlp-1.15
- Volume:
- Proceedings of the 3rd Clinical Natural Language Processing Workshop
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- ClinicalNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 130–135
- Language:
- URL:
- https://aclanthology.org/2020.clinicalnlp-1.15
- DOI:
- 10.18653/v1/2020.clinicalnlp-1.15
- Cite (ACL):
- Zhi Wen, Xing Han Lu, and Siva Reddy. 2020. MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 130–135, Online. Association for Computational Linguistics.
- Cite (Informal):
- MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining (Wen et al., ClinicalNLP 2020)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2020.clinicalnlp-1.15.pdf
- Code
- mcGill-NLP/medal
- Data
- MeDAL, ADAM, MIMIC-III, Pubmed