Jing Gao


2021

pdf bib
Knowledge-Guided Paraphrase Identification
Haoyu Wang | Fenglong Ma | Yaqing Wang | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2021

Paraphrase identification (PI), a fundamental task in natural language processing, is to identify whether two sentences express the same or similar meaning, which is a binary classification problem. Recently, BERT-like pre-trained language models have been a popular choice for the frameworks of various PI models, but almost all existing methods consider general domain text. When these approaches are applied to a specific domain, existing models cannot make accurate predictions due to the lack of professional knowledge. In light of this challenge, we propose a novel framework, namely , which can leverage the external unstructured Wikipedia knowledge to accurately identify paraphrases. We propose to mine outline knowledge of concepts related to given sentences from Wikipedia via BM25 model. After retrieving related outline knowledge, makes predictions based on both the semantic information of two sentences and the outline knowledge. Besides, we propose a gating mechanism to aggregate the semantic information-based prediction and the knowledge-based prediction. Extensive experiments are conducted on two public datasets: PARADE (a computer science domain dataset) and clinicalSTS2019 (a biomedical domain dataset). The results show that the proposed outperforms state-of-the-art methods.

pdf bib
Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework
Yaqing Wang | Haoda Chu | Chao Zhang | Jing Gao
Findings of the Association for Computational Linguistics: EMNLP 2021

In this work, we study the problem of named entity recognition (NER) in a low resource scenario, focusing on few-shot and zero-shot settings. Built upon large-scale pre-trained language models, we propose a novel NER framework, namely SpanNER, which learns from natural language supervision and enables the identification of never-seen entity classes without using in-domain labeled data. We perform extensive experiments on 5 benchmark datasets and evaluate the proposed method in the few-shot learning, domain transfer and zero-shot learning settings. The experimental results show that the proposed method can bring 10%, 23% and 26% improvements in average over the best baselines in few-shot learning, domain transfer and zero-shot learning settings respectively.

pdf bib
Profanity-Avoiding Training Framework for Seq2seq Models with Certified Robustness
Hengtong Zhang | Tianhang Zheng | Yaliang Li | Jing Gao | Lu Su | Bo Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Seq2seq models have demonstrated their incredible effectiveness in a large variety of applications. However, recent research has shown that inappropriate language in training samples and well-designed testing cases can induce seq2seq models to output profanity. These outputs may potentially hurt the usability of seq2seq models and make the end-users feel offended. To address this problem, we propose a training framework with certified robustness to eliminate the causes that trigger the generation of profanity. The proposed training framework leverages merely a short list of profanity examples to prevent seq2seq models from generating a broader spectrum of profanity. The framework is composed of a pattern-eliminating training component to suppress the impact of language patterns with profanity in the training set, and a trigger-resisting training component to provide certified robustness for seq2seq models against intentionally injected profanity-triggering expressions in test samples. In the experiments, we consider two representative NLP tasks that seq2seq can be applied to, i.e., style transfer and dialogue generation. Extensive experimental results show that the proposed training framework can successfully prevent the NLP models from generating profanity.