What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection
Mina Valizadeh, Xing Qian, Pardis Ranjbar-Noiey, Cornelia Caragea, Natalie Parde
Abstract
Recognizing medical self-disclosure is important in many healthcare contexts, but it has been under-explored by the NLP community. We conduct a three-pronged investigation of this task. We (1) manually expand and refine the only existing medical self-disclosure corpus, resulting in a new, publicly available dataset of 3,919 social media posts with clinically validated labels and high compatibility with the existing task-specific protocol. We also (2) study the merits of pretraining task domain and text style by comparing Transformer-based models for this task, pretrained from general, medical, and social media sources. Our BERTweet condition outperforms the existing state of the art for this task by a relative F1 score increase of 16.73%. Finally, we (3) compare data augmentation techniques for this task, to assess the extent to which medical self-disclosure data may be further synthetically expanded. We discover that this task poses many challenges for data augmentation techniques, and we provide an in-depth analysis of identified trends.- Anthology ID:
- 2023.eacl-main.86
- Volume:
- Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1201–1216
- Language:
- URL:
- https://aclanthology.org/2023.eacl-main.86
- DOI:
- Cite (ACL):
- Mina Valizadeh, Xing Qian, Pardis Ranjbar-Noiey, Cornelia Caragea, and Natalie Parde. 2023. What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1201–1216, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection (Valizadeh et al., EACL 2023)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2023.eacl-main.86.pdf