Xiaowei Wang


2026

Emotion Recognition in Conversation (ERC) focuses on identifying static emotional states, overlooking the cognitive mechanisms that drive emotional transitions. This work introduces a novel emotion prediction task grounded in Appraisal Theory, which conceptualizes emotion as a cognitive evaluation of expectations and their violations. To address this task, we develop a prompt-based reasoning framework that breaks emotional dynamics into three interpretable stages, e.g., expectation inference, violation detection, and emotion-shift prediction, thereby explaining not only which emotion is expressed, but also why it emerges. To examine whether LLMs exhibit human-like affective reasoning, we design six appraisal-informed prompting tasks and evaluate eight representative LLMs across four conversational corpora. A unified two-level evaluation, which measures both emotion classification and transition dynamics, reveals that explicit expectation cues improve accuracy by up to +2.4%, whereas violation-only cues often degrade performance. Our analysis uncovers a robust appraisal pattern across models and datasets: expectation construction is the primary contributor to accurate emotion prediction, while isolated violation cues tend to induce misattribution rather than improve causal reasoning. Beyond label accuracy, transition-level evaluation shows that LLMs capture emotion-shift direction above chance but exhibit a marked stability bias, over-predicting no-change trajectories and under-detecting fine-grained shifts. These findings demonstrate both the promise and the current limits of LLMs in appraisal-driven affective reasoning, and motivate a new cognitively-grounded research direction.
Narrative interpretation is an essential aspect of human cognition, enabling individuals to comprehend complex sequences of events, form emotional connections, and engage in nuanced social reasoning. At the heart of this interpretive ability lies emotional understanding, which cognitive scientists often frame through Appraisal Theory, a model that views emotions as the outcome of subjective evaluations of events in relation to goals, values, and beliefs. In this study, we explore whether multimodal large language models (MLLMs) are able to replicate aspects of this human-like narrative and emotional reasoning. Specifically, we examine how well MLLMs interpret visual narratives, with a focus on their ability to identify and appraise emotional content within scenes. We also investigate whether these models can utilize additional narrative descriptions generated by them to enhance their emotional recognition capabilities, as humans often do. To probe these questions, we conducted a series of experiments using two publicly available datasets, EMOTIC and HECO. Contrary to our expectations, our results reveal a consistent and noteworthy pattern: rather than improving the models’ performance, the inclusion of supplementary narrative or contextual information frequently diminishes their ability to accurately recognize emotions. This counterintuitive finding suggests that current MLLMs face significant challenges in integrating multimodal information in a coherent, context-sensitive way. These findings underscore key limitations in the emotional and narrative reasoning capabilities of existing MLLMs and highlight a critical gap between human cognitive processes and current AI approaches.