Simin Hong

2025

pdf bib abs
Adversarial Alignment with Anchor Dragging Drift (A³D²): Multimodal Domain Adaptation with Partially Shifted Modalities
Jun Sun | Xinxin Zhang | Simin Hong | Jian Zhu | Lingfang Zeng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal learning has celebrated remarkable success across diverse areas, yet faces the challenge of prohibitively expensive data collection and annotation when adapting models to new environments. In this context, domain adaptation has gained growing popularity as a technique for knowledge transfer, which, however, remains underexplored in multimodal settings compared with unimodal ones. This paper investigates multimodal domain adaptation, focusing on a practical partially shifting scenario where some modalities (referred to as anchors) remain domain-stable, while others (referred to as drifts) undergo a domain shift. We propose a bi-alignment scheme to simultaneously perform drift-drift and anchor-drift matching. The former is achieved through adversarial learning, aligning the representations of the drifts across source and target domains; the latter corresponds to an “anchor dragging drift” strategy, which matches the distributions of the drifts and anchors within the target domain using the optimal transport (OT) method. The overall design principle features Adversarial Alignment with Anchor Dragging Drift, abbreviated as A³D², for multimodal domain adaptation with partially shifted modalities. Comprehensive empirical results verify the effectiveness of the proposed approach, and demonstrate that A³D² achieves superior performance compared with state-of-the-art approaches. The code is available at: https://github.com/sunjunaimer/A3D2.git.

2024

pdf bib abs
Amanda: Adaptively Modality-Balanced Domain Adaptation for Multimodal Emotion Recognition
Xinxin Zhang | Jun Sun | Simin Hong | Taihao Li
Findings of the Association for Computational Linguistics: ACL 2024

This paper investigates unsupervised multimodal domain adaptation for multimodal emotion recognition, which is a solution for data scarcity yet remains under studied. Due to the varying distribution discrepancies of different modalities between source and target domains, the primary challenge lies in how to balance the domain alignment across modalities to guarantee they are all well aligned. To achieve this, we first develop our model based on the information bottleneck theory to learn optimal representation for each modality independently. Then, we align the domains via matching the label distributions and the representations. In order to balance the representation alignment, we propose to minimize a surrogate of the alignment losses, which is equivalent to adaptively adjusting the weights of the modalities throughout training, thus achieving balanced domain alignment across modalities. Overall, the proposed approach features Adaptively modality-balanced domain adaptation, dubbed Amanda, for multimodal emotion recognition. Extensive empirical results on commonly used benchmark datasets demonstrate that Amanda significantly outperforms competing approaches. The code is available at https://github.com/sunjunaimer/Amanda.

pdf bib abs
DetectiveNN: Imitating Human Emotional Reasoning with a Recall-Detect-Predict Framework for Emotion Recognition in Conversations
Simin Hong | Jun Sun | Taihao Li
Findings of the Association for Computational Linguistics: EMNLP 2024

Emotion Recognition in conversations (ERC) involves an internal cognitive process that interprets emotional cues by using a collection of past emotional experiences. However, many existing methods struggle to decipher emotional cues in dialogues since they are insufficient in understanding the rich historical emotional context. In this work, we introduce an innovative Detective Network (DetectiveNN), a novel model that is grounded in the cognitive theory of emotion and utilizes a “recall-detect-predict” framework to imitate human emotional reasoning. This process begins by ‘recalling’ past interactions of a specific speaker to collect emotional cues. It then ‘detects’ relevant emotional patterns by interpreting these cues in the context of the ongoing conversation. Finally, it ‘predicts’ the speaker’s current emotional state. Tested on three benchmark datasets, our approach significantly outperforms existing methods. This highlights the advantages of incorporating cognitive factors into deep learning for ERC, enhancing task efficacy and prediction accuracy.

Co-authors

Venues

findings2
acl1

Fix author