Shiqin Han

2026

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
Sijie Mai | Shiqin Han
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal affective computing aims to predict humans’ sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we introduce a theoretically grounded disentanglement method that separates each modality into ‘causal invariant representation’ and ‘environment-specific spurious representation’ from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint. Experiments across multiple multimodal benchmarks demonstrate that CmIR achieves state-of-the-art performance. CmIR particularly excels on out-of-distribution data and noisy data, confirming its robustness and generalizability.

2025

pdf bib abs

Supervised Attention Mechanism for Low-quality Multimodal Data
Sijie Mai | Shiqin Han | Haifeng Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In practical applications, multimodal data are often of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance, robustness, and applicability. However, current studies address these issues separately. To this end, we propose a framework for multimodal affective computing that jointly addresses missing and noisy modalities to enhance model robustness in low-quality data scenarios. Specifically, we view missing modality as a special case of noisy modality, and propose a supervised attention framework. In contrast to traditional attention mechanisms that rely on main task loss to update the parameters, we design supervisory signals for the learning of attention weights, ensuring that attention mechanisms can focus on discriminative information and suppress noisy information. We further propose a ranking-based optimization strategy to compare the relative importance of different interactions by adding a ranking constraint for attention weights, avoiding training noise caused by inaccurate absolute labels. The proposed model consistently outperforms state-of-the-art baselines on multiple datasets under the settings of complete modalities, missing modalities, and noisy modalities.

Co-authors

Sijie Mai 2
Haifeng Hu 1

Venues

ACL1
EMNLP1

Fix author