Yanyan Zhao

2022

Multimodal sentiment analysis has attracted increasing attention and lots of models have been proposed. However, the performance of the state-of-the-art models decreases sharply when they are deployed in the real world. We find that the main reason is that real-world applications can only access the text outputs by the automatic speech recognition (ASR) models, which may be with errors because of the limitation of model capacity. Through further analysis of the ASR outputs, we find that in some cases the sentiment words, the key sentiment elements in the textual modality, are recognized as other words, which makes the sentiment of the text change and hurts the performance of multimodal sentiment analysis models directly. To address this problem, we propose the sentiment word aware multimodal refinement model (SWRM), which can dynamically refine the erroneous sentiment words by leveraging multimodal sentiment clues. Specifically, we first use the sentiment word position detection module to obtain the most possible position of the sentiment word in the text and then utilize the multimodal sentiment word refinement module to dynamically refine the sentiment word embeddings. The refined embeddings are taken as the textual inputs of the multimodal feature fusion module to predict the sentiment labels. We conduct extensive experiments on the real-world datasets including MOSI-Speechbrain, MOSI-IBM, and MOSI-iFlytek and the results demonstrate the effectiveness of our model, which surpasses the current state-of-the-art models on three datasets. Furthermore, our approach can be adapted for other multimodal feature fusion models easily.

pdf abs
Face-Sensitive Image-to-Emotional-Text Cross-modal Translation for Multimodal Aspect-based Sentiment Analysis
Hao Yang | Yanyan Zhao | Bing Qin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Aspect-level multimodal sentiment analysis, which aims to identify the sentiment of the target aspect from multimodal data, recently has attracted extensive attention in the community of multimedia and natural language processing. Despite the recent success in textual aspect-based sentiment analysis, existing models mainly focused on utilizing the object-level semantic information in the image but ignore explicitly using the visual emotional cues, especially the facial emotions. How to distill visual emotional cues and align them with the textual content remains a key challenge to solve the problem. In this work, we introduce a face-sensitive image-to-emotional-text translation (FITE) method, which focuses on capturing visual sentiment cues through facial expressions and selectively matching and fusing with the target aspect in textual modality. To the best of our knowledge, we are the first that explicitly utilize the emotional information from images in the multimodal aspect-based sentiment analysis task. Experiment results show that our method achieves state-of-the-art results on the Twitter-2015 and Twitter-2017 datasets. The improvement demonstrates the superiority of our model in capturing aspect-level sentiment in multimodal data with facial expressions.

pdf abs
SSR: Utilizing Simplified Stance Reasoning Process for Robust Stance Detection
Jianhua Yuan | Yanyan Zhao | Yanyue Lu | Bing Qin
Proceedings of the 29th International Conference on Computational Linguistics

Dataset bias in stance detection tasks allows models to achieve superior performance without using targets. Most existing debiasing methods are task-agnostic, which fail to utilize task knowledge to better discriminate between genuine and bias features. Motivated by how humans tackle stance detection tasks, we propose to incorporate the stance reasoning process as task knowledge to assist in learning genuine features and reducing reliance on bias features. The full stance reasoning process usually involves identifying the span of the mentioned target and corresponding opinion expressions, such fine-grained annotations are hard and expensive to obtain. To alleviate this, we simplify the stance reasoning process to relax the granularity of annotations from token-level to sentence-level, where labels for sub-tasks can be easily inferred from existing resources. We further implement those sub-tasks by maximizing mutual information between the texts and the opinioned targets. To evaluate whether stance detection models truly understand the task from various aspects, we collect and construct a series of new test sets. Our proposed model achieves better performance than previous task-agnostic debiasing methods on most of those new test sets while maintaining comparable performances to existing stance detection models.

pdf abs
MuCDN: Mutual Conversational Detachment Network for Emotion Recognition in Multi-Party Conversations
Weixiang Zhao | Yanyan Zhao | Bing Qin
Proceedings of the 29th International Conference on Computational Linguistics

As an emerging research topic in natural language processing community, emotion recognition in multi-party conversations has attained increasing interest. Previous approaches that focus either on dyadic or multi-party scenarios exert much effort to cope with the challenge of emotional dynamics and achieve appealing results. However, since emotional interactions among speakers are often more complicated within the entangled multi-party conversations, these works are limited in capturing effective emotional clues in conversational context. In this work, we propose Mutual Conversational Detachment Network (MuCDN) to clearly and effectively understand the conversational context by separating conversations into detached threads. Specifically, two detachment ways are devised to perform context and speaker-specific modeling within detached threads and they are bridged through a mutual module. Experimental results on two datasets show that our model achieves better performance over the baseline models.

2021

pdf
A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis
Yang Wu | Zijie Lin | Yanyan Zhao | Bing Qin | Li-Nan Zhu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf abs
Retrieve, Discriminate and Rewrite: A Simple and Effective Framework for Obtaining Affective Response in Retrieval-Based Chatbots
Xin Lu | Yijian Tian | Yanyan Zhao | Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2021

Obtaining affective response is a key step in building empathetic dialogue systems. This task has been studied a lot in generation-based chatbots, but the related research in retrieval-based chatbots is still in the early stage. Existing works in retrieval-based chatbots are based on Retrieve-and-Rerank framework, which have a common problem of satisfying affect label at the expense of response quality. To address this problem, we propose a simple and effective Retrieve-Discriminate-Rewrite framework. The framework replaces the reranking mechanism with a new discriminate-and-rewrite mechanism, which predicts the affect label of the retrieved high-quality response via discrimination module and further rewrites the affect unsatisfied response via rewriting module. This can not only guarantee the quality of the response, but also satisfy the given affect label. In addition, another challenge for this line of research is the lack of an off-the-shelf affective response dataset. To address this problem and test our proposed framework, we annotate a Sentimental Douban Conversation Corpus based on the original Douban Conversation Corpus. Experimental results show that our proposed framework is effective and outperforms competitive baselines.

2020

Emotion recognition in conversations (ERC) has received much attention recently in the natural language processing community. Considering that the emotions of the utterances in conversations are interactive, previous works usually implicitly model the emotion interaction between utterances by modeling dialogue context, but the misleading emotion information from context often interferes with the emotion interaction. We noticed that the gold emotion labels of the context utterances can provide explicit and accurate emotion interaction, but it is impossible to input gold labels at inference time. To address this problem, we propose an iterative emotion interaction network, which uses iteratively predicted emotion labels instead of gold emotion labels to explicitly model the emotion interaction. This approach solves the above problem, and can effectively retain the performance advantages of explicit modeling. We conduct experiments on two datasets, and our approach achieves state-of-the-art performance.