Wei Pan


2024

pdf
Multi-Granularity Fusion Text Semantic Matching Based on WoBERT
Hongchun Yu | Wei Pan | Xing Fan | Hanqi Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Text semantic matching is crucial in natural language processing, applied in information retrieval, question answering, and recommendation systems. Traditional text-matching methods struggle with semantic nuances in short text. Recent advancements in multi-granularity representation learning have led to increased interest in improving text semantic matching models. We propose a novel multi-granularity fusion model that harnesses WoBERT, a pre-trained language model, to enhance the accuracy of text semantic information capture. Initially, we process text using WoBERT to acquire semantic representations, effectively capturing individual text semantic nuances. Next, we employ a soft attention alignment mechanism, enabling multi-granularity fusions among characters, words, and sentences, thus further improving matching performance. Our approach was evaluated through experiments on common Chinese short text matching datasets, BQ and LCQMC. Results reveal a significant improvement in performance compared to traditional methods, particularly in terms of accuracy.

2023

pdf
Supervised Gradual Machine Learning for Aspect-Term Sentiment Analysis
Yanyan Wang | Qun Chen | Murtadha H.M. Ahmed | Zhaoqiang Chen | Jing Su | Wei Pan | Zhanhuai Li
Transactions of the Association for Computational Linguistics, Volume 11

Recent work has shown that Aspect-Term Sentiment Analysis (ATSA) can be effectively performed by Gradual Machine Learning (GML). However, the performance of the current unsupervised solution is limited by inaccurate and insufficient knowledge conveyance. In this paper, we propose a supervised GML approach for ATSA, which can effectively exploit labeled training data to improve knowledge conveyance. It leverages binary polarity relations between instances, which can be either similar or opposite, to enable supervised knowledge conveyance. Besides the explicit polarity relations indicated by discourse structures, it also separately supervises a polarity classification DNN and a binary Siamese network to extract implicit polarity relations. The proposed approach fulfills knowledge conveyance by modeling detected relations as binary features in a factor graph. Our extensive experiments on real benchmark data show that it achieves the state-of-the-art performance across all the test workloads. Our work demonstrates clearly that, in collaboration with DNN for feature extraction, GML outperforms pure DNN solutions.

2021

pdf
融合自编码器和对抗训练的中文新词发现方法(Finding Chinese New Word By Combining Self-encoder and Adversarial Training)
Wei Pan (潘韦) | Tianyuan Liu (刘天元) | Yuqing Sun (孙宇清) | Bin Gong (龚斌) | Yongman Zhang (张永满) | Ping Yang (杨萍)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

新词的不断涌现是语言的自然规律,如在专业领域中新概念和实体名称代表了专业领域中某些共同特征集合的抽象概括,经常作为关键词在句子中承担一定的角色。新词发现问题直接影响中文分词结果和后继文本语义理解任务的性能,是自然语言处理研究领域的重要任务。本文提出了融合自编码器和对抗训练的中文新词发现模型,采用字符级别的自编码器和无监督自学习的方式进行预训练,可以有效提取语义信息,不受分词结果影响,适用于不同领域的文本;同时为了引入通用语言学知识,添加了先验句法分析结果,借助领域共享编码器融合语义和语法信息,以提升划分歧义词的准确性;采用对抗训练机制,以提取领域无关特征,减少对于人工标注语料的依赖。实验选择六个不同的专业领域数据集评估新词发现任务,结果显示本文模型优于其他现有方法;结合模型析构实验,详细验证了各个模块的有效性。同时通过选择不同类型的源域数据和不同数量的目标域数据进行对比实验,验证了模型的鲁棒性。最后以可视化的方式对比了自编码器和共享编码器对不同领域数据的编码结果,显示了对抗训练方法能够有效地提取两者之间的相关性和差异性信息。