Yuchen Wang


2023

pdf
TeamShakespeare at SemEval-2023 Task 6: Understand Legal Documents with Contextualized Large Language Models
Xin Jin | Yuchen Wang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The growth of pending legal cases in populouscountries, such as India, has become a major is-sue. Developing effective techniques to processand understand legal documents is extremelyuseful in resolving this problem. In this pa-per, we present our systems for SemEval-2023Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the com-prehensive context information in both intra-and inter-sentence levels to predict rhetoricalroles (subtask A) and then train a Legal-LUKEmodel, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B).Our evaluations demonstrate that our designedmodels are more accurate than baselines, e.g.,with an up to 15.0% better F1 score in subtaskB. We achieved notable performance in the taskleaderboard, e.g., 0.834 micro F1 score, andranked No.5 out of 27 teams in subtask A.

2022

pdf
Whodunit? Learning to Contrast for Authorship Attribution
Bo Ai | Yuchen Wang | Yugin Tan | Samson Tan
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Authorship attribution is the task of identifying the author of a given text. The key is finding representations that can differentiate between authors. Existing approaches typically use manually designed features that capture a dataset’s content and style, but these approaches are dataset-dependent and yield inconsistent performance across corpora. In this work, we propose to learn author-specific representations by fine-tuning pre-trained generic language representations with a contrastive objective (Contra-X). We show that Contra-X learns representations that form highly separable clusters for different authors. It advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8% over cross-entropy fine-tuning. However, we find that Contra-X improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to integrate contrastive learning with pre-trained language model fine-tuning for authorship attribution.

2020

pdf
基于强负采样的词嵌入优化算法(Word Embedding Optimization Based on Hard Negative Sampling)
Yuchen Wang (王雨晨) | Miaozhe Lin (林淼哲) | Jiefan Zhan (詹杰凡)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

word2vec是自然语言处理领域重要的词嵌入算法之一,为了解决随机负采样作为优化目标可能出现的样本贡献消失问题,提出了可以应用在CBOW和Skip-gram框架上的以余弦距离为度量的强负采样方法:HNS-CBOW和HNS-SG。将原随机负采样过程拆解为两个步骤,首先,计算随机负样本与目标词的余弦距离,然后,再使用距离较近的强负样本更新参数。以英文维基百科数据作为实验语料,在公开的语义-语法数据集上对优化算法的效果进行了定量分析,实验表明,优化后的词嵌入质量显著优于原方法。同时,与GloVe等公开发布的预训练词向量相比,可以在更小的语料库上获得更高的准确性。

2018

pdf
Analyzing the Quality of Counseling Conversations: the Tell-Tale Signs of High-quality Counseling
Verónica Pérez-Rosas | Xuetong Sun | Christy Li | Yuchen Wang | Kenneth Resnicow | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)