Min Hu

2025

pdf bib abs
Proxy-Driven Robust Multimodal Sentiment Analysis with Incomplete Data
Aoqiang Zhu | Min Hu | Xiaohua Wang | Jiaoyun Yang | Yiming Tang | Ning An
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal Sentiment Analysis (MSA) with incomplete data has gained significant attention recently. Existing studies focus on optimizing model structures to handle modality missingness, but models still face challenges in robustness when dealing with uncertain missingness. To this end, we propose a data-centric robust multimodal sentiment analysis method, Proxy-Driven Robust Multimodal Fusion (P-RMF). First, we map unimodal data to the latent space of Gaussian distributions to capture core features and structure, thereby learn stable modality representation. Then, we combine the quantified inherent modality uncertainty to learn stable multimodal joint representation (i.e., proxy modality), which is further enhanced through multi-layer dynamic cross-modal injection to increase its diversity. Extensive experimental results show that P-RMF outperforms existing models in noise resistance and achieves state-of-the-art performance on multiple benchmark datasets. Code will be available at https://github.com/***/P-RMF.

Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect-sentiment pairs from text and image data. While significant progress has been made in image-aspect alignment, due to the subtlety and complexity of language expressions, there are not always explicit aspect words in the language to align with images. Existing methods typically assume a direct alignment between images and aspects, matching the entire image with a corresponding aspect. This rough alignment of images and aspects introduces noise. To address the above issues, this paper proposes a Dual-Aware Enhanced Alignment Network (DaNet) designed for fine-grained multimodal aspect-image alignment and denoising. Specifically, we first introduce a Multimodal Denoising Encoder (MDE) that jointly image and text to guide the compression and denoising of visual sequences. And then, aspect-aware and sentiment-aware networks are constructed to jointly enhance fine-grained alignment and denoising of text-image information. To better align implicit aspects, an Implicit Aspect Opinion Generation (IAOG) pretraining is designed under the guidance of large language model. Extensive experiments across three MABSA subtasks demonstrate that DaNet outperforms existing methods. Code will be available at https://github.com/***/DaNet.

Multimodal Sentiment Analysis (MSA) integrates diverse modalities to overcome the limitations of unimodal data. However, existing MSA datasets commonly exhibit significant sentiment distribution imbalances and cross-modal sentiment conflicts, which hinder performance improvement. This paper shows that distributional discrepancies and sentiment conflicts can be incorporated into the model training to learn stable multimodal invariant sentiment representation. To this end, we propose a Multimodal Invariant Sentiment Representation Learning (MISR) method. Specifically, we first learn a stable and consistent multimodal joint representation in the latent space of Gaussian distribution based on distributional constraints Then, under invariance constraint, we further learn multimodal invariant sentiment representations from multiple distributional environments constructed by the joint representation and unimodal data, achieving robust and efficient MSA performance. Extensive experiments demonstrate that MISR significantly enhances MSA performance and achieves new state-of-the-art.

2024

pdf bib abs
MTLS: Making Texts into Linguistic Symbols
Wenlong Fei | Xiaohua Wang | Min Hu | Qingyu Zhang | Hongbo Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In linguistics, all languages can be considered as symbolic systems, with each language relying on symbolic processes to associate specific symbols with meanings. In the same language, there is a fixed correspondence between linguistic symbol and meaning. In different languages, universal meanings follow varying rules of symbolization in one-to-one correspondence with symbols. Most work overlooks the properties of languages as symbol systems. In this paper, we shift the focus to the symbolic properties and introduce MTLS: a pre-training method to improve the multilingual capability of models by Making Texts into Linguistic Symbols. Initially, we replace the vocabulary in pre-trained language models by mapping relations between linguistic symbols and semantics. Subsequently, universal semantics within the symbolic system serve as bridges, linking symbols from different languages to the embedding space of the model, thereby enabling the model to process linguistic symbols. To evaluate the effectiveness of MTLS, we conducted experiments on multilingual tasks using BERT and RoBERTa, respectively, as the backbone. The results indicate that despite having just over 12,000 pieces of English data in pre-training, the improvement that MTLS brings to multilingual capabilities is remarkably significant.

pdf bib abs
MEVTR: A Multilingual Model Enhanced with Visual Text Representations
Xiaohua Wang | Wenlong Fei | Min Hu | Qingyu Zhang | Aoqiang Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The goal of multilingual modelling is to generate multilingual text representations for various downstream tasks in different languages. However, some state-of-the-art pre-trained multilingual models perform poorly on many low-resource languages due to the lack of representation space and model capacity. To alleviate this issue, we propose a Multilingual model Enhanced with Visual Text Representations (MEVTR), which complements textual representations and extends the multilingual representation space with visual text representations. First, the visual encoder focuses on the glyphs and structure of the text to obtain visual text representations, and the textual encoder obtains textual representations. Then, multilingual representations are enhanced by aligning and fusing visual text representations and textual representations. Moreover, we propose similarity constraint, a self-supervised task to prompt the visual encoder to focus on more additional information. Prefix alignment and multi-head bilinear module are designed to acquire an improved integration effect of visual text representations and textual representations. Experimental results indicate that MEVTR benefits from visual text representations and achieves significant performance gains in downstream tasks. In particular, in the zero-shot cross-lingual transfer task, MEVTR achieves results that outperform the state-of-the-art adapter-based framework without the target language adapter.

2020

A Dialogue State Tracker (DST) is a core component of a modular task-oriented dialogue system. Tremendous progress has been made in recent years. However, the major challenges remain. The state-of-the-art accuracy for DST is below 50% for a multi-domain dialogue task. A learnable DST for any new domain requires a large amount of labeled in-domain data and training from scratch. In this paper, we propose a Meta-Reinforced Multi-Domain State Generator (MERET). Our first contribution is to improve the DST accuracy. We enhance a neural model based DST generator with a reward manager, which is built on policy gradient reinforcement learning (RL) to fine-tune the generator. With this change, we are able to improve the joint accuracy of DST from 48.79% to 50.91% on the MultiWOZ corpus. Second, we explore to train a DST meta-learning model with a few domains as source domains and a new domain as target domain. We apply the model-agnostic meta-learning algorithm (MAML) to DST and the obtained meta-learning model is used for new domain adaptation. Our experimental results show this solution is able to outperform the traditional training approach with extremely less training data in target domain.

Open-vocabulary slots, such as file name, album name, or schedule title, significantly degrade the performance of neural-based slot filling models since these slots can take on values from a virtually unlimited set and have no semantic restriction nor a length limit. In this paper, we propose a robust adversarial model-agnostic slot filling method that explicitly decouples local semantics inherent in open-vocabulary slot words from the global context. We aim to depart entangled contextual semantics and focus more on the holistic context at the level of the whole sentence. Experiments on two public datasets show that our method consistently outperforms other methods with a statistically significant margin on all the open-vocabulary slots without deteriorating the performance of normal slots.

pdf bib abs
A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning
Yichi Zhang | Zhijian Ou | Min Hu | Junlan Feng
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Structured belief states are crucial for user goal tracking and database query in task-oriented dialog systems. However, training belief trackers often requires expensive turn-level annotations of every user utterance. In this paper we aim at alleviating the reliance on belief state labels in building end-to-end dialog systems, by leveraging unlabeled dialog data towards semi-supervised learning. We propose a probabilistic dialog model, called the LAtent BElief State (LABES) model, where belief states are represented as discrete latent variables and jointly modeled with system responses given user inputs. Such latent variable modeling enables us to develop semi-supervised learning under the principled variational learning framework. Furthermore, we introduce LABES-S2S, which is a copy-augmented Seq2Seq model instantiation of LABES. In supervised experiments, LABES-S2S obtains strong results on three benchmark datasets of different scales. In utilizing unlabeled dialog data, semi-supervised LABES-S2S significantly outperforms both supervised-only and semi-supervised baselines. Remarkably, we can reduce the annotation demands to 50% without performance loss on MultiWOZ.

pdf bib abs
A structure-enhanced graph convolutional network for sentiment analysis
Fanyu Meng | Junlan Feng | Danping Yin | Si Chen | Min Hu
Findings of the Association for Computational Linguistics: EMNLP 2020

Syntactic information is essential for both sentiment analysis(SA) and aspect-based sentiment analysis(ABSA). Previous work has already achieved great progress utilizing Graph Convolutional Network(GCN) over dependency tree of a sentence. However, these models do not fully exploit the syntactic information obtained from dependency parsing such as the diversified types of dependency relations. The message passing process of GCN should be distinguished based on these syntactic information. To tackle this problem, we design a novel weighted graph convolutional network(WGCN) which can exploit rich syntactic information based on the feature combination. Furthermore, we utilize BERT instead of Bi-LSTM to generate contextualized representations as inputs for GCN and present an alignment method to keep word-level dependencies consistent with wordpiece unit of BERT. With our proposal, we are able to improve the state-of-the-art on four ABSA tasks out of six and two SA tasks out of three.