Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Cuong Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri, Tobias Heimann, Thomas Runkler


Abstract
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
Anthology ID:
2024.clinicalnlp-1.21
Volume:
Proceedings of the 6th Clinical Natural Language Processing Workshop
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Danielle Bitterman
Venues:
ClinicalNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
246–257
Language:
URL:
https://aclanthology.org/2024.clinicalnlp-1.21
DOI:
Bibkey:
Cite (ACL):
Cuong Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri, Tobias Heimann, and Thomas Runkler. 2024. Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 246–257, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering (Ha et al., ClinicalNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.clinicalnlp-1.21.pdf