Jiacheng Li

2023

pdf bib abs
Open-world Semi-supervised Generalized Relation Discovery Aligned in a Real-world Setting
William Hogan | Jiacheng Li | Jingbo Shang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Open-world Relation Extraction (OpenRE) has recently garnered significant attention. However, existing approaches tend to oversimplify the problem by assuming that all instances of unlabeled data belong to novel classes, thereby limiting the practicality of these methods. We argue that the OpenRE setting should be more aligned with the characteristics of real-world data. Specifically, we propose two key improvements: (a) unlabeled data should encompass known and novel classes, including negative instances; and (b) the set of novel classes should represent long-tail relation types. Furthermore, we observe that popular relations can often be implicitly inferred through specific patterns, while long-tail relations tend to be explicitly expressed. Motivated by these insights, we present a method called KNoRD (Known and Novel Relation Discovery), which effectively classifies explicitly and implicitly expressed relations from known and novel classes within unlabeled data. Experimental evaluations on several Open-world RE benchmarks demonstrate that KNoRD consistently outperforms other existing methods, achieving significant performance gains.

pdf bib abs
NetEase.AI at SemEval-2023 Task 2: Enhancing Complex Named Entities Recognition in Noisy Scenarios via Text Error Correction and External Knowledge
Ruixuan Lu | Zihang Tang | Guanglong Hu | Dong Liu | Jiacheng Li
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Complex named entities (NE), like the titles of creative works, are not simple nouns and pose challenges for NER systems. In the SemEval 2023, Task 2: MultiCoNER II was proposed, whose goal is to recognize complex entities against out of knowledge-base entities and noisy scenarios. To address the challenges posed by MultiCoNER II, our team NetEase.AI proposed an entity recognition system that integrates text error correction system and external knowledge, which can recognize entities in scenes that contain entities out of knowledge base and text with noise. Upon receiving an input sentence, our systems will correct the sentence, extract the entities in the sentence as candidate set using the entity recognition model that incorporates the gazetteer information, and then use the external knowledge to classify the candidate entities to obtain entity type features. Finally, our system fused the multi-dimensional features of the candidate entities into a stacking model, which was used to select the correct entities from the candidate set as the final output. Our system exhibited good noise resistance and excellent entity recognition performance, resulting in our team’s first place victory in the Chinese track of MultiCoNER II.

2022

pdf bib abs
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining
Jiacheng Li | Jingbo Shang | Julian McAuley
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

High-quality phrase representations are essential to finding topics and related terms in documents (a.k.a. topic mining). Existing phrase representation learning methods either simply combine unigram representations in a context-free manner or rely on extensive annotations to learn context-aware knowledge. In this paper, we propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to the pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning (CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly. UCTopic outperforms the state-of-the-art phrase representation model by 38.2% NMI in average on four entity clustering tasks. Comprehensive evaluation on topic mining shows that UCTopic can extract coherent and diverse topical phrases.

pdf bib abs
Fine-grained Contrastive Learning for Relation Extraction
William Hogan | Jiacheng Li | Jingbo Shang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent relation extraction (RE) works have shown encouraging improvements by conducting contrastive learning on silver labels generated by distant supervision before fine-tuning on gold labels. Existing methods typically assume all these silver labels are accurate and treat them equally; however, distant supervision is inevitably noisy–some silver labels are more reliable than others. In this paper, we propose fine-grained contrastive learning (FineCL) for RE, which leverages fine-grained information about which silver labels are and are not noisy to improve the quality of learned relationship representations for RE. We first assess the quality of silver labels via a simple and automatic approach we call “learning order denoising,” where we train a language model to learn these relations and record the order of learned training instances. We show that learning order largely corresponds to label accuracy–early-learned silver labels have, on average, more accurate labels than later-learned silver labels. Then, during pre-training, we increase the weights of accurate labels within a novel contrastive learning objective. Experiments on several RE benchmarks show that FineCL makes consistent and significant performance gains over state-of-the-art methods.

pdf bib abs
MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction
Yue Zhang | Zhenghua Li | Zuyi Bao | Jiacheng Li | Bo Zhang | Chen Li | Fei Huang | Min Zhang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper presents MuCGEC, a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three Chinese-as-a-Second-Language (CSL) learner sources. Each sentence is corrected by three annotators, and their corrections are carefully reviewed by a senior annotator, resulting in 2.3 references per sentence. We conduct experiments with two mainstream CGEC models, i.e., the sequence-to-sequence model and the sequence-to-edit model, both enhanced with large pretrained language models, achieving competitive benchmark performance on previous and our datasets. We also discuss CGEC evaluation methodologies, including the effect of multiple references and using a char-based metric. Our annotation guidelines, data, and code are available at https://github.com/HillZhang1999/MuCGEC.

2021

pdf bib abs
Weakly Supervised Named Entity Tagging with Learnable Logical Rules
Jiacheng Li | Haibo Ding | Jingbo Shang | Julian McAuley | Zhe Feng
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We study the problem of building entity tagging systems by using a few rules as weak supervision. Previous methods mostly focus on disambiguating entity types based on contexts and expert-provided rules, while assuming entity spans are given. In this work, we propose a novel method TALLOR that bootstraps high-quality logical rules to train a neural tagger in a fully automated manner. Specifically, we introduce compound rules that are composed from simple rules to increase the precision of boundary detection and generate more diverse pseudo labels. We further design a dynamic label selection strategy to ensure pseudo label quality and therefore avoid overfitting the neural tagger. Experiments on three datasets demonstrate that our method outperforms other weakly supervised methods and even rivals a state-of-the-art distantly supervised tagger with a lexicon of over 2,000 terms when starting from only 20 simple rules. Our method can serve as a tool for rapidly building taggers in emerging domains and tasks. Case studies show that learned rules can potentially explain the predicted entities.

2020

Sensor metadata tagging, akin to the named entity recognition task, provides key contextual information (e.g., measurement type and location) about sensors for running smart building applications. Unfortunately, sensor metadata in different buildings often follows distinct naming conventions. Therefore, learning a tagger currently requires extensive annotations on a per building basis. In this work, we propose a novel framework, SeNsER, which learns a sensor metadata tagger for a new building based on its raw metadata and some existing fully annotated building. It leverages the commonality between different buildings: At the character level, it employs bidirectional neural language models to capture the shared underlying patterns between two buildings and thus regularizes the feature learning process; At the word level, it leverages as features the k-mers existing in the fully annotated building. During inference, we further incorporate the information obtained from sources such as Wikipedia as prior knowledge. As a result, SeNsER shows promising results in extensive experiments on multiple real-world buildings.

2019

pdf bib abs
Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects
Jianmo Ni | Jiacheng Li | Julian McAuley
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Several recent works have considered the problem of generating reviews (or ‘tips’) as a form of explanation as to why a recommendation might match a customer’s interests. While promising, we demonstrate that existing approaches struggle (in terms of both quality and content) to generate justifications that are relevant to users’ decision-making process. We seek to introduce new datasets and methods to address the recommendation justification task. In terms of data, we first propose an ‘extractive’ approach to identify review segments which justify users’ intentions; this approach is then used to distantly label massive review corpora and construct large-scale personalized recommendation justification datasets. In terms of generation, we are able to design two personalized generation models with this data: (1) a reference-based Seq2Seq model with aspect-planning which can generate justifications covering different aspects, and (2) an aspect-conditional masked language model which can generate diverse justifications based on templates extracted from justification histories. We conduct experiments on two real-world datasets which show that our model is capable of generating convincing and diverse justifications.