Jiecheng Zhang


2025

pdf bib
Incongruity-aware Tension Field Network for Multi-modal Sarcasm Detection
Jiecheng Zhang | C.L.Philip Chen | Shuzhen Li | Tong Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-modal sarcasm detection (MSD) identifies sarcasm and accurately understands users’ real attitudes from text-image pairs. Most MSD researches explore the incongruity of text-image pairs as sarcasm information through consistency preference methods. However, these methods prioritize consistency over incongruity and blur incongruity information under their global feature aggregation mechanisms, leading to incongruity distortions and model misinterpretations. To address the above issues, this paper proposes a pioneering inconsistency preference method called incongruity-aware tension field network (ITFNet) for multi-modal sarcasm detection tasks. Specifically, ITFNet extracts effective text-image feature pairs in fact and sentiment perspectives. It then constructs a fact/sentiment tension field with discrepancy metrics to capture the contextual tone and polarized incongruity after the iterative learning of tension intensity, effectively highlighting incongruity information during such inconsistency preference learning. It further standardizes the polarized incongruity with reference to contextual tone to obtain standardized incongruity, effectively implementing instance standardization for unbiased decision-making in MSD. ITFNet performs well in extracting salient and standardized incongruity through an incongruity-aware tension field, significantly tackling incongruity distortions and cross-instance variance. Moreover, ITFNet achieves state-of-the-art performance surpassing LLaVA1.5-7B with only 17.3M trainable parameters, demonstrating its optimal performance-efficiency in multi-modal sarcasm detection tasks.