MediVLM: A Vision Language Model for Radiology Report Generation from Medical Images

Debanjan Goswami, Ronast Subedi, Shayok Chakraborty


Abstract
Generating radiology reports from medical images has garnered sufficient attention in the research community. While existing methods have demonstrated promise, they often tend to generate reports that are factually incomplete and inconsistent, fail to focus on informative regions within an image, and impose strong annotation assumptions, such as bounding box annotations, image level annotations (which can be challenging to obtain) for model training. In this paper, we propose MediVLM, a vision language model (VLM) for radiology report generation from medical images. The proposed model consists of a pre-trained object detector to extract the salient anatomical regions from the images, an image encoder, a text encoder, a module to align the visual and text representations, a cross attention layer to fuse the two representations and finally, a transformer based decoder to generate the final report. MediVLM can generate radiology reports even when no reports are available for training; this is an extremely useful feature, as curating such reports is a labor-intensive task. Further, it computes a severity score (depicting the seriousness of a patient’s medical condition) from the generated radiology reports, which can be used to prioritize patients who need immediate medical attention. Our extensive empirical analyses on three benchmark datasets corroborate the promise and potential of our method against competing baselines. Our code is open-sourcedin our project webpage at: https://sites.google.com/view/medivlm/home
Anthology ID:
2025.findings-emnlp.544
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10287–10304
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.544/
DOI:
10.18653/v1/2025.findings-emnlp.544
Bibkey:
Cite (ACL):
Debanjan Goswami, Ronast Subedi, and Shayok Chakraborty. 2025. MediVLM: A Vision Language Model for Radiology Report Generation from Medical Images. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10287–10304, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MediVLM: A Vision Language Model for Radiology Report Generation from Medical Images (Goswami et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.544.pdf
Checklist:
 2025.findings-emnlp.544.checklist.pdf