This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
YangyangWu
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries. Code is released on https://github.com/jiangsongtao/HSCR.
The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from complete multimodal training data, rendering them ineffective in addressing the common occurrence of missing modalities in real-world scenarios. In this paper, we propose a hierarchical soft prompt model TriSPrompt, which integrates three types of prompts, i.e., modality-aware (MA) prompt, modality-missing (MM) prompt, and mutual-views (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model’s adaptability to missing information. The MV prompt learns relationships between subjective (i.e., text and image) and objective (i.e., comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that TriSPrompt achieves an accuracy gain of over 13% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.4open.science/r/code-3E88.
Missing modality issues are common in real-world applications, arising from factors such as equipment failures and privacy concerns. When fine-tuning pre-trained models on downstream datasets with missing modalities, performance can degrade significantly. Current methods often aggregate various missing cases to train recovery modules or align multimodal features, resulting in suboptimal performance, high computational costs, and the risk of catastrophic forgetting in continual environments where data arrives sequentially. In this paper, we formulate the dynamic missing modality problem as a continual learning task and introduce the continual multimodal missing modality task. To address this challenge efficiently, we introduce three types of prompts: modality-specific, task-aware, and task-specific prompts. These prompts enable the model to learn intra-modality, inter-modality, intra-task, and inter-task features. Furthermore, we propose a contrastive task interaction strategy to explicitly learn prompts correlating different modalities. We conduct extensive experiments on three public datasets, where our method consistently outperforms state-of-the-art approaches.