Hongxing Wang
2024
MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering
Dexuan Xu
|
Yanyuan Chen
|
Jieyi Wang
|
Yue Huang
|
Hanpin Wang
|
Zhi Jin
|
Hongxing Wang
|
Weihua Yue
|
Jing He
|
Hang Li
|
Yu Huang
Findings of the Association for Computational Linguistics: ACL 2024
Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model’s capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.
Search
Co-authors
- Dexuan Xu 1
- Hang Li 1
- Hanpin Wang 1
- Jieyi Wang 1
- Jing He 1
- show all...