SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Kapadnis; Sohan Patnaik; Abhilash Nandy; Sourjyadip Ray; Pawan Goyal; Debdoot Sheet

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, Debdoot Sheet

Abstract

Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don’t accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LlaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.

Anthology ID:: 2024.clinicalnlp-1.24
Volume:: Proceedings of the 6th Clinical Natural Language Processing Workshop
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Danielle Bitterman
Venues:: ClinicalNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 283–291
Language:
URL:: https://aclanthology.org/2024.clinicalnlp-1.24
DOI:
Bibkey:
Cite (ACL):: Manav Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, and Debdoot Sheet. 2024. SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 283–291, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models (Kapadnis et al., ClinicalNLP-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.clinicalnlp-1.24.pdf

PDF Search