Wenyu Guo


2026

With the widespread proliferation of the Internet, the spread of fake news has accelerated significantly, evolving from single-text content to multimodal forms that include images and videos. The task of Multimodal Fake News Detection (MFND) takes both text and relevant images as input for fake news identification. However, issues such as image noise and inaccurate focus of visual features often lead to insufficient attention to critical information within images during multimodal fusion. To effectively address these challenges, we propose a covariance matrix-driven image channel allocation method. This method first expands the number of original channel maps, then evaluates the importance of image channels through the covariance matrix and assigns importance scores to the expanded channel maps, thereby redirecting the focus of visual features. Subsequently, we design a multimodal fusion strategy based on a multilayer co-attention mechanism to achieve dynamic fusion across modalities. Finally, a contrastive learning loss is introduced to enhance the alignment between textual and visual modalities. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three public multimodal fake news detection benchmark datasets.
Multimodal content combining textual and visual information poses significant challenges for rumor detection on social media. Compared to traditional spatial domain features, frequency domain features have attracted increasing attention due to their stronger discriminative capabilities. However, existing methods still fall short in capturing cross-modal semantic inconsistencies and often overlook inherent noise in multimodal features, which limits overall detection performance. To address these issues, we propose a novel multimodal rumor detection method based on multi-scale spectral selection and entropy-guided uncertainty fusion. Specifically, we first apply the Discrete Cosine Transform (DCT) to image and text features to convert them into the frequency domain. Then, multi-scale convolutional filters are employed to extract fine-grained information across different frequency scales. Next, modality separation is performed to capture both shared and modality-specific features, enabling more effective cross-modal representation learning. Finally, entropy is used to estimate the uncertainty of each prediction branch, calculate confidence scores, and perform adaptive weighted fusion accordingly. Experimental results on multiple benchmark datasets demonstrate that our method outperforms existing state-of-the-art approaches in multimodal rumor detection, demonstrating stronger detection capability and robustness.

2024

“The objective of the Chinese Vision-Language Understanding Evaluation (CVLUE) is to comprehensively assess the performance of Chinese vision-language multimodal pre-trained models in multimodal modeling and understanding across four tasks: Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. To enhance the models’ performance across various multimodal tasks, this paper propose a multimodal information understanding enhancement method based on answer-guided images. Firstly, we propose task-specific methods for answer-guided image generation. Secondly, the authentic and answer-guided images are fed into the model for multimodal fine-tuning, respectively. Finally, training objectives are set for different tasks to minimize the gap between the answer-guided images and authentic images, thereby supervising the results produced by the authentic images utlizing answer-guided images. The experimental results demonstrate the effectiveness of the proposed method.”

2023

Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation. Since there is no paired image available for the input sentence in most cases, recent studies suggest utilizing powerful text-to-image generation models to provide image inputs. Nevertheless, synthetic images generated by these models often follow different distributions compared to authentic images. Consequently, using authentic images for training and synthetic images for inference can introduce a distribution shift, resulting in performance degradation during inference. To tackle this challenge, in this paper, we feed synthetic and authentic images to the MMT model, respectively. Then we minimize the gap between the synthetic and authentic images by drawing close the input image representations of the Transformer Encoder and the output distributions of the Transformer Decoder. Therefore, we mitigate the distribution disparity introduced by the synthetic images during inference, thereby freeing the authentic images from the inference process. Experimental results show that our approach achieves state-of-the-art performance on the Multi30K En-De and En-Fr datasets, while remaining independent of authentic images during inference.