Ping Yang (杨萍) - ACL Anthology

Ping Yang

2022

We propose a new paradigm for zero-shot learners that is format agnostic, i.e., it is compatible with any format and applicable to a list of language tasks, such as text classification, commonsense reasoning, coreference resolution, and sentiment analysis. Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training. Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN. It not only adds generalization ability to models but also significantly reduces the number of parameters. Our method shares the merits of efficient training and deployment. Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification. Our model achieves this success with only 235M parameters, which is substantially smaller than state-of-the-art models with billions of parameters. The code and pre-trained models are available at https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/unimc .

2021

pdf abs
面向法律文本的实体关系联合抽取算法(Joint Entity and Relation Extraction for Legal Texts)
Wenhui Song (宋文辉) | Xiang Zhou (周翔) | Ping Yang (杨萍) | Yuanyuan Sun (孙媛媛) | Liang Yang (杨亮) | Hongfei Lin (林鸿飞)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

法律文本中包含的丰富信息可以通过结构化的实体关系三元组进行表示,便于法律知识的存储和查询。传统的流水线方法在自动抽取三元组时执行了大量冗余计算,造成了误差传播。而现有的联合学习方法无法适用于有大量重叠关系的法律文本,也并未关注语法结构信息对文本表示的增强,因此本文提出一种面向法律文本的实体关系联合抽取模型。该模型首先通过ON-LSTM注入语法信息,然后引入多头注意力机制分解重叠关系。相较于流水线和其他联合学习方法本文模型抽取效果最佳,在涉毒类法律文本数据集上抽取结果的F1值达到78.7%。

pdf abs
融合自编码器和对抗训练的中文新词发现方法(Finding Chinese New Word By Combining Self-encoder and Adversarial Training)
Wei Pan (潘韦) | Tianyuan Liu (刘天元) | Yuqing Sun (孙宇清) | Bin Gong (龚斌) | Yongman Zhang (张永满) | Ping Yang (杨萍)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

新词的不断涌现是语言的自然规律,如在专业领域中新概念和实体名称代表了专业领域中某些共同特征集合的抽象概括,经常作为关键词在句子中承担一定的角色。新词发现问题直接影响中文分词结果和后继文本语义理解任务的性能,是自然语言处理研究领域的重要任务。本文提出了融合自编码器和对抗训练的中文新词发现模型,采用字符级别的自编码器和无监督自学习的方式进行预训练,可以有效提取语义信息,不受分词结果影响,适用于不同领域的文本;同时为了引入通用语言学知识,添加了先验句法分析结果,借助领域共享编码器融合语义和语法信息,以提升划分歧义词的准确性;采用对抗训练机制,以提取领域无关特征,减少对于人工标注语料的依赖。实验选择六个不同的专业领域数据集评估新词发现任务,结果显示本文模型优于其他现有方法;结合模型析构实验,详细验证了各个模块的有效性。同时通过选择不同类型的源域数据和不同数量的目标域数据进行对比实验,验证了模型的鲁棒性。最后以可视化的方式对比了自编码器和共享编码器对不同领域数据的编码结果,显示了对抗训练方法能够有效地提取两者之间的相关性和差异性信息。