Jinghan Sun
2026
Multimodal Dual-Path Decoding for Medical Report Generation
Jinghan Sun | Dong Wei | Zhihong Zhu | Yuyang Xue | Steven McDonagh | Xian Wu
Findings of the Association for Computational Linguistics: ACL 2026
Jinghan Sun | Dong Wei | Zhihong Zhu | Yuyang Xue | Steven McDonagh | Xian Wu
Findings of the Association for Computational Linguistics: ACL 2026
Radiology report generation requires precise alignment between medical imaging findings and clinically coherent textual descriptions. While current methods predominantly rely on either large vision-language models (LVLMs) for visual grounding or large language models (LLMs) for medical narrative generation, they often fail to effectively integrate multimodal clinical evidence with domain-specific knowledge. This paper proposes a novel multimodal dual-path framework that synergistically combines LVLMs and LLMs to address these limitations. Our approach establishes a dynamic fusion between LVLMs’ visual-semantic grounding capabilities and LLMs’ clinical knowledge reasoning. Specifically, we employ a structured prompting strategy that models the report generation task into three clinically meaningful sections and introduces fine-grained multi-label classification prompts to guide the models, enabling more accurate and comprehensive clinical report generation. Experiments on the public MIMIC-CXR benchmark demonstrate our framework’s superiority over state-of-the-art methods.
2025
A Survey on Multi-modal Intent Recognition: Recent Advances and New Frontiers
Zhihong Zhu | Fan Zhang | Yunyan Zhang | Jinghan Sun | Zhiqi Huang | Qingqing Long | Bowen Xing | Xian Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhihong Zhu | Fan Zhang | Yunyan Zhang | Jinghan Sun | Zhiqi Huang | Qingqing Long | Bowen Xing | Xian Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
Multi-modal intent recognition (MIR) requires integrating non-verbal cues from real-world contexts to enhance human intention understanding, which has attracted substantial research attention in recent years. Despite promising advancements, a comprehensive survey summarizing recent advances and new frontiers remains absent. To this end, we present a thorough and unified review of MIR, covering different aspects including (1) Extensive survey: we take the first step to present a thorough survey of this research field covering textual, visual (image/video), and acoustic signals. (2) Unified taxonomy: we provide a unified framework including evaluation protocol and advanced methods to summarize the current progress in MIR. (3) Emerging frontiers: We discuss some future directions such as multi-task, multi-domain, and multi-lingual MIR, and give our thoughts respectively. (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope this survey can shed light on future research in MIR.