ChengXiang Zhai

Also published as: Chengxiang Zhai


2022

pdf
Domain Representative Keywords Selection: A Probabilistic Approach
Pritom Saha Akash | Jie Huang | Kevin Chang | Yunyao Li | Lucian Popa | ChengXiang Zhai
Findings of the Association for Computational Linguistics: ACL 2022

We propose a probabilistic approach to select a subset of a target domain representative keywords from a candidate set, contrasting with a context domain. Such a task is crucial for many downstream tasks in natural language processing. To contrast the target domain and the context domain, we adapt the two-component mixture model concept to generate a distribution of candidate keywords. It provides more importance to the distinctive keywords of the target domain than common keywords contrasting with the context domain. To support the representativeness of the selected keywords towards the target domain, we introduce an optimization algorithm for selecting the subset from the generated candidate distribution. We have shown that the optimization algorithm can be efficiently implemented with a near-optimal approximation guarantee. Finally, extensive experiments on multiple domains demonstrate the superiority of our approach over other baselines for the tasks of keyword summary generation and trending keywords selection.

pdf
Improving Candidate Retrieval with Entity Profile Generation for Wikidata Entity Linking
Tuan Lai | Heng Ji | ChengXiang Zhai
Findings of the Association for Computational Linguistics: ACL 2022

Entity linking (EL) is the task of linking entity mentions in a document to referent entities in a knowledge base (KB). Many previous studies focus on Wikipedia-derived KBs. There is little work on EL over Wikidata, even though it is the most extensive crowdsourced KB. The scale of Wikidata can open up many new real-world applications, but its massive number of entities also makes EL challenging. To effectively narrow down the search space, we propose a novel candidate retrieval paradigm based on entity profiling. Wikidata entities and their textual fields are first indexed into a text search engine (e.g., Elasticsearch). During inference, given a mention and its context, we use a sequence-to-sequence (seq2seq) model to generate the profile of the target entity, which consists of its title and description. We use the profile to query the indexed search engine to retrieve candidate entities. Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary, enabling us to further design a highly effective hybrid method for candidate retrieval. Combined with a simple cross-attention reranker, our complete EL framework achieves state-of-the-art results on three Wikidata-based datasets and strong performance on TACKBP-2010.

pdf
Language Model Pre-Training with Sparse Latent Typing
Liliang Ren | Zixuan Zhang | Han Wang | Clare Voss | ChengXiang Zhai | Heng Ji
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Modern large-scale Pre-trained Language Models (PLMs) have achieved tremendous success on a wide range of downstream tasks. However, most of the LM pre-training objectives only focus on text reconstruction, but have not sought to learn latent-level interpretable representations of sentences. In this paper, we manage to push the language models to obtain a deeper understanding of sentences by proposing a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge. Besides, the language model pre-trained with such an objective also significantly improves Information Extraction related downstream tasks in both supervised and few-shot settings. Our code is publicly available at https://github.com/renll/SparseLT.

pdf
Generation of Student Questions for Inquiry-based Learning
Kevin Ros | Maxwell Jong | Chak Ho Chan | ChengXiang Zhai
Proceedings of the 15th International Conference on Natural Language Generation

pdf
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Bhavya Bhavya | Jinjun Xiong | ChengXiang Zhai
Proceedings of the 15th International Conference on Natural Language Generation

pdf
CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval
Kung-Hsiang Huang | ChengXiang Zhai | Heng Ji
Proceedings of the 29th International Conference on Computational Linguistics

Fact-checking has gained increasing attention due to the widespread of falsified information. Most fact-checking approaches focus on claims made in English only due to the data scarcity issue in other languages. The lack of fact-checking datasets in low-resource languages calls for an effective cross-lingual transfer technique for fact-checking. Additionally, trustworthy information in different languages can be complementary and helpful in verifying facts. To this end, we present the first fact-checking framework augmented with cross-lingual retrieval that aggregates evidence retrieved from multiple languages through a cross-lingual retriever. Given the absence of cross-lingual information retrieval datasets with claim-like queries, we train the retriever with our proposed Cross-lingual Inverse Cloze Task (X-ICT), a self-supervised algorithm that creates training instances by translating the title of a passage. The goal for X-ICT is to learn cross-lingual retrieval in which the model learns to identify the passage corresponding to a given translated title. On the X-Fact dataset, our approach achieves 2.23% absolute F1 improvement in the zero-shot cross-lingual setup over prior systems. The source code and data are publicly available at https://github.com/khuangaf/CONCRETE.

2021

pdf
Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries
Carl Edwards | ChengXiang Zhai | Heng Ji
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We propose a new task, Text2Mol, to retrieve molecules using natural language descriptions as queries. Natural language and molecules encode information in very different ways, which leads to the exciting but challenging problem of integrating these two very different modalities. Although some work has been done on text-based retrieval and structure-based retrieval, this new task requires integrating molecules and natural language more directly. Moreover, this can be viewed as an especially challenging cross-lingual retrieval problem by considering the molecules as a language with a very unique grammar. We construct a paired dataset of molecules and their corresponding text descriptions, which we use to learn an aligned common semantic embedding space for retrieval. We extend this to create a cross-modal attention-based model for explainability and reranking by interpreting the attentions as association rules. We also employ an ensemble approach to integrate our different architectures, which significantly improves results from 0.372 to 0.499 MRR. This new multimodal approach opens a new perspective on solving problems in chemistry literature understanding and molecular machine learning.

pdf
BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks
Tuan Lai | Heng Ji | ChengXiang Zhai
Findings of the Association for Computational Linguistics: EMNLP 2021

Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base. Recently, many BERT-based models have been introduced for the task. While these models achieve competitive results on many datasets, they are computationally expensive and contain about 110M parameters. Little is known about the factors contributing to their impressive performance and whether the over-parameterization is needed. In this work, we shed some light on the inner workings of these large BERT-based models. Through a set of probing experiments, we have found that the entity linking performance only changes slightly when the input word order is shuffled or when the attention scope is limited to a fixed window size. From these observations, we propose an efficient convolutional neural network with residual connections for biomedical entity linking. Because of the sparse connectivity and weight sharing properties, our model has a small number of parameters and is highly efficient. On five public datasets, our model achieves comparable or even better linking accuracy than the state-of-the-art BERT-based models while having about 60 times fewer parameters.

pdf
Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference
Tuan Lai | Heng Ji | ChengXiang Zhai | Quan Hung Tran
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Compared to the general news domain, information extraction (IE) from biomedical text requires much broader domain knowledge. However, many previous IE methods do not utilize any external knowledge during inference. Due to the exponential growth of biomedical publications, models that do not go beyond their fixed set of parameters will likely fall behind. Inspired by how humans look up relevant information to comprehend a scientific text, we present a novel framework that utilizes external knowledge for joint entity and relation extraction named KECI (Knowledge-Enhanced Collective Inference). Given an input text, KECI first constructs an initial span graph representing its initial understanding of the text. It then uses an entity linker to form a knowledge graph containing relevant background knowledge for the the entity mentions in the text. To make the final predictions, KECI fuses the initial span graph and the knowledge graph into a more refined graph using an attention mechanism. KECI takes a collective approach to link mention spans to entities by integrating global relational information into local representations using graph convolutional networks. Our experimental results show that the framework is highly effective, achieving new state-of-the-art results in two different benchmark datasets: BioRelEx (binding interaction detection) and ADE (adverse drug event extraction). For example, KECI achieves absolute improvements of 4.59% and 4.91% in F1 scores over the state-of-the-art on the BioRelEx entity and relation extraction tasks

2020

pdf
Multi-task Learning for Multilingual Neural Machine Translation
Yiren Wang | ChengXiang Zhai | Hany Hassan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While monolingual data has been shown to be useful in improving bilingual neural machine translation (NMT), effectively and efficiently leveraging monolingual data for Multilingual NMT (MNMT) systems is a less explored area. In this work, we propose a multi-task learning (MTL) framework that jointly trains the model with the translation task on bitext data and two denoising tasks on the monolingual data. We conduct extensive empirical studies on MNMT systems with 10 language pairs from WMT datasets. We show that the proposed approach can effectively improve the translation quality for both high-resource and low-resource languages with large margin, achieving significantly better results than the individual bilingual models. We also demonstrate the efficacy of the proposed approach in the zero-shot setup for language pairs without bitext training data. Furthermore, we show the effectiveness of MTL over pre-training approaches for both NMT and cross-lingual transfer learning NLU tasks; the proposed approach outperforms massive scale models trained on single task.

2019

pdf
TILM: Neural Language Models with Evolving Topical Influence
Shubhra Kanti Karmaker Santu | Kalyan Veeramachaneni | Chengxiang Zhai
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Content of text data are often influenced by contextual factors which often evolve over time (e.g., content of social media are often influenced by topics covered in the major news streams). Existing language models do not consider the influence of such related evolving topics, and thus are not optimal. In this paper, we propose to incorporate such topical-influence into a language model to both improve its accuracy and enable cross-stream analysis of topical influences. Specifically, we propose a novel language model called Topical Influence Language Model (TILM), which is a novel extension of a neural language model to capture the influences on the contents in one text stream by the evolving topics in another related (or possibly same) text stream. Experimental results on six different text stream data comprised of conference paper titles show that the incorporation of evolving topical influence into a language model is beneficial and TILM outperforms multiple baselines in a challenging task of text forecasting. In addition to serving as a language model, TILM further enables interesting analysis of topical influence among multiple text streams.

2017

pdf
Identifying Humor in Reviews using Background Text Sources
Alex Morales | Chengxiang Zhai
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We study the problem of automatically identifying humorous text from a new kind of text data, i.e., online reviews. We propose a generative language model, based on the theory of incongruity, to model humorous text, which allows us to leverage background text sources, such as Wikipedia entry descriptions, and enables construction of multiple features for identifying humorous reviews. Evaluation of these features using supervised learning for classifying reviews into humorous and non-humorous reviews shows that the features constructed based on the proposed generative model are much more effective than the major features proposed in the existing literature, allowing us to achieve almost 86% accuracy. These humorous review predictions can also supply good indicators for identifying helpful reviews.

2016

pdf
MeTA: A Unified Toolkit for Text Retrieval and Analysis
Sean Massung | Chase Geigle | ChengXiang Zhai
Proceedings of ACL-2016 System Demonstrations

2012

pdf
A Discriminative Model for Query Spelling Correction with Latent Structural SVM
Huizhong Duan | Yanen Li | ChengXiang Zhai | Dan Roth
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf
Structural Topic Model for Latent Topical Structure Analysis
Hongning Wang | Duo Zhang | ChengXiang Zhai
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf
Cross-Lingual Latent Topic Extraction
Duo Zhang | Qiaozhu Mei | ChengXiang Zhai
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Summarizing Contrastive Viewpoints in Opinionated Text
Michael Paul | ChengXiang Zhai | Roxana Girju
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf
Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions
Kavita Ganesan | ChengXiang Zhai | Jiawei Han
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Exploiting Structured Ontology to Organize Scattered Online Opinions
Yue Lu | Huizhong Duan | Hongning Wang | ChengXiang Zhai
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Shallow Information Extraction from Medical Forum Data
Parikshit Sondhi | Manish Gupta | ChengXiang Zhai | Julia Hockenmaier
Coling 2010: Posters

2008

pdf
Generating Impact-Based Summaries for Scientific Literature
Qiaozhu Mei | ChengXiang Zhai
Proceedings of ACL-08: HLT

2007

pdf
Instance Weighting for Domain Adaptation in NLP
Jing Jiang | ChengXiang Zhai
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
Candace Sidner | Tanja Schultz | Matthew Stone | ChengXiang Zhai
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf
A Systematic Exploration of the Feature Space for Relation Extraction
Jing Jiang | ChengXiang Zhai
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Candace Sidner | Tanja Schultz | Matthew Stone | ChengXiang Zhai
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Statistical Language Models for Information Retrieval
ChengXiang Zhai
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

2006

pdf
Named Entity Transliteration with Comparable Corpora
Richard Sproat | Tao Tao | ChengXiang Zhai
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation
Tao Tao | Su-Youn Yoon | Andrew Fister | Richard Sproat | ChengXiang Zhai
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf
Exploiting Domain Structure for Named Entity Recognition
Jing Jiang | ChengXiang Zhai
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
Language Model Information Retrieval with Document Expansion
Tao Tao | Xuanhui Wang | Qiaozhu Mei | ChengXiang Zhai
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

1997

pdf
Fast Statistical Parsing of Noun Phrases for Document Indexing
Chengxiang Zhai
Fifth Conference on Applied Natural Language Processing

1996

pdf
Noun Phrase Analysis in Large Unrestricted Text for Information Retrieval
David A. Evans | Chengxiang Zhai
34th Annual Meeting of the Association for Computational Linguistics