Jieyi Wang
2026
SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
Sirry Chen | Jieyi Wang | Wei Chen | Zhongyu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sirry Chen | Jieyi Wang | Wei Chen | Zhongyu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of **(1) Knowledge Capability Injection via Text** and **(2) Modality Re-alignment with Limited Speech Data**, thereby reducing the requirement for medical speech data to only **10k** synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.
2024
MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering
Dexuan Xu | Yanyuan Chen | Jieyi Wang | Yue Huang | Hanpin Wang | Zhi Jin | Hongxing Wang | Weihua Yue | Jing He | Hang Li | Yu Huang
Findings of the Association for Computational Linguistics: ACL 2024
Dexuan Xu | Yanyuan Chen | Jieyi Wang | Yue Huang | Hanpin Wang | Zhi Jin | Hongxing Wang | Weihua Yue | Jing He | Hang Li | Yu Huang
Findings of the Association for Computational Linguistics: ACL 2024
Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model’s capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.