This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
FuqingZhu
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Hateful memes detection is a challenging multimodal understanding task that requires comprehensive learning of vision, language, and cross-modal interactions. Previous research has focused on developing effective fusion strategies for integrating hate information from different modalities. However, these methods excessively rely on cross-modal fusion features, ignoring the modality uncertainty caused by the contribution degree of each modality to hate sentiment and the modality imbalance caused by the dominant modality suppressing the optimization of another modality. To this end, this paper proposes an Uncertainty-guided Modal Rebalance (UMR) framework for hateful memes detection. The uncertainty of each meme is explicitly formulated by designing stochastic representation drawn from a Gaussian distribution for aggregating cross-modal features with unimodal features adaptively. The modality imbalance is alleviated by improving cosine loss from the perspectives of inter-modal feature and weight vectors constraints. In this way, the suppressed unimodal representation ability in multimodal models would be unleashed, while the learning of modality contribution would be further promoted. Extensive experimental results demonstrate that the proposed UMR produces the state-of-the-art performance on four widely-used datasets.
Cold-start is a significant problem in recommender systems. Recently, with the development of few-shot learning and meta-learning techniques, many researchers have devoted themselves to adopting meta-learning into recommendation as the natural scenario of few-shots. Nevertheless, we argue that recent work has a huge gap between few-shot learning and recommendations. In particular, users are locally dependent, not globally independent in recommendation. Therefore, it is necessary to formulate the local relationships between users. To accomplish this, we present a novel Few-shot learning method for Cold-Start (FCS) recommendation that consists of three hierarchical structures. More concretely, this first hierarchy is the global-meta parameters for learning the global information of all users; the second hierarchy is the local-meta parameters whose goal is to learn the adaptive cluster of local users; the third hierarchy is the specific parameters of the target user. Both the global and local information are formulated, addressing the new user’s problem in accordance with the few-shot records rapidly. Experimental results on two public real-world datasets show that the FCS method could produce stable improvements compared with the state-of-the-art.
Hate speech detection has become an urgent task with the emergence of huge multimodal harmful content (, memes) on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from memes. However, these methods ignore two key points: 1) the misalignment of image and text in memes caused by the modality gap, and 2) the uncertainty between modalities caused by the contribution degree of each modality to hate sentiment. To this end, this paper proposes an uncertainty-aware cross-modal alignment (UCA) framework for modeling the misalignment and uncertainty in multimodal hate speech detection. Specifically, we first utilize the cross-modal feature encoder to capture image and text feature representations in memes. Then, a cross-modal alignment module is applied to reduce semantic gaps between modalities by aligning the feature representations. Next, a cross-modal fusion module is designed to learn semantic interactions between modalities to capture cross-modal correlations, providing complementary features for memes. Finally, a cross-modal uncertainty learning module is proposed, which evaluates the divergence between unimodal feature distributions to to balance unimodal and cross-modal fusion features. Extensive experiments on five publicly available datasets show that the proposed UCA produces a competitive performance compared with the existing multimodal hate speech detection methods.
Multimodal emotion recognition for video has gained considerable attention in recent years, in which three modalities (i.e., textual, visual and acoustic) are involved. Due to the diverse levels of informational content related to emotion, three modalities typically possess varying degrees of contribution to emotion recognition. More seriously, there might be inconsistencies between the emotion of individual modality and the video. The challenges mentioned above are caused by the inherent uncertainty of emotion. Inspired by the recent advances of quantum theory in modeling uncertainty, we make an initial attempt to design a quantum-inspired adaptive-priority-learning model (QAP) to address the challenges. Specifically, the quantum state is introduced to model modal features, which allows each modality to retain all emotional tendencies until the final classification. Additionally, we design Q-attention to orderly integrate three modalities, and then QAP learns modal priority adaptively so that modalities can provide different amounts of information based on priority. Experimental results on the IEMOCAP and MOSEI datasets show that QAP establishes new state-of-the-art results.
Script learning aims to predict the subsequent event according to the existing event chain. Recent studies focus on event co-occurrence to solve this problem. However, few studies integrate external event knowledge to solve this problem. With our observations, external event knowledge can provide additional knowledge like temporal or causal knowledge for understanding event chain better and predicting the right subsequent event. In this work, we integrate event knowledge from ASER (Activities, States, Events and their Relations) knowledge base to help predict the next event. We propose a new approach consisting of knowledge retrieval stage and knowledge integration stage. In the knowledge retrieval stage, we select relevant external event knowledge from ASER. In the knowledge integration stage, we propose three methods to integrate external knowledge into our model and infer final answers. Experiments on the widely-used Multi- Choice Narrative Cloze (MCNC) task show our approach achieves state-of-the-art performance compared to other methods.
Multi-turn retrieval-based conversation is an important task for building intelligent dialogue systems. Existing works mainly focus on matching candidate responses with every context utterance on multiple levels of granularity, which ignore the side effect of using excessive context information. Context utterances provide abundant information for extracting more matching features, but it also brings noise signals and unnecessary information. In this paper, we will analyze the side effect of using too many context utterances and propose a multi-hop selector network (MSN) to alleviate the problem. Specifically, MSN firstly utilizes a multi-hop selector to select the relevant utterances as context. Then, the model matches the filtered context with the candidate response and obtains a matching score. Experimental results show that MSN outperforms some state-of-the-art methods on three public multi-turn dialogue datasets.