Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning

Nurbanu Aksoy, Nishant Ravikumar, Serge Sharoff


Abstract
Image-to-text generation involves automatically generating descriptive text from images and has applications in medical report generation. However, traditional approaches often exhibit a semantic gap between visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning, especially with text generation prioritisation, improved performance over single-task baselines across language generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models. Qualitative analysis showed logically coherent narratives and accurate identification of findings, though some repetition and disjointed phrasing remained. This work demonstrates the benefits of multi-modal, multi-task learning for image-to-text generation applications.
Anthology ID:
2024.lrec-main.529
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
5977–5985
Language:
URL:
https://aclanthology.org/2024.lrec-main.529
DOI:
Bibkey:
Cite (ACL):
Nurbanu Aksoy, Nishant Ravikumar, and Serge Sharoff. 2024. Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5977–5985, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning (Aksoy et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.lrec-main.529.pdf