Xuan Wang


Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds
Yu Zhang | Yu Meng | Xuan Wang | Sheng Wang | Jiawei Han
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Discovering latent topics from text corpora has been studied for decades. Many existing topic models adopt a fully unsupervised setting, and their discovered topics may not cater to users’ particular interests due to their inability of leveraging user guidance. Although there exist seed-guided topic discovery approaches that leverage user-provided seeds to discover topic-representative terms, they are less concerned with two factors: (1) the existence of out-of-vocabulary seeds and (2) the power of pre-trained language models (PLMs). In this paper, we generalize the task of seed-guided topic discovery to allow out-of-vocabulary seeds. We propose a novel framework, named SeeTopic, wherein the general knowledge of PLMs and the local semantics learned from the input corpus can mutually benefit each other. Experiments on three real datasets from different domains demonstrate the effectiveness of SeeTopic in terms of topic coherence, accuracy, and diversity.


ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision
Xuan Wang | Vivian Hu | Xiangchen Song | Shweta Garg | Jinfeng Xiao | Jiawei Han
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement).

Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training
Yu Meng | Yunyi Zhang | Jiaxin Huang | Xuan Wang | Yu Zhang | Heng Ji | Jiawei Han
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.

COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
Qingyun Wang | Manling Li | Xuan Wang | Nikolaus Parulian | Guangxing Han | Jiawei Ma | Jingxuan Tu | Ying Lin | Ranran Haoran Zhang | Weili Liu | Aabhas Chauhan | Yingjun Guan | Bangzheng Li | Ruisong Li | Xiangchen Song | Yi Fung | Heng Ji | Jiawei Han | Shih-Fu Chang | James Pustejovsky | Jasmine Rah | David Liem | Ahmed ELsayed | Martha Palmer | Clare Voss | Cynthia Schneider | Boyan Onyshkevych
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.

Noise Robust Named Entity Understanding for Voice Assistants
Deepak Muralidharan | Joel Ruben Antony Moniz | Sida Gao | Xiao Yang | Justine Kao | Stephen Pulman | Atish Kothari | Ray Shen | Yinying Pan | Vivek Kaul | Mubarak Seyed Ibrahim | Gang Xiang | Nan Dun | Yidan Zhou | Andy O | Yuan Zhang | Pooja Chitkara | Xuan Wang | Alkesh Patel | Kushal Tayal | Roger Zheng | Peter Grasch | Jason D Williams | Lin Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.


EVIDENCEMINER: Textual Evidence Discovery for Life Sciences
Xuan Wang | Yingjun Guan | Weili Liu | Aabhas Chauhan | Enyi Jiang | Qi Li | David Liem | Dibakar Sigdel | John Caufield | Peipei Ping | Jiawei Han
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Traditional search engines for life sciences (e.g., PubMed) are designed for document retrieval and do not allow direct retrieval of specific statements. Some of these statements may serve as textual evidence that is key to tasks such as hypothesis generation and new finding validation. We present EVIDENCEMINER, a web-based system that lets users query a natural language statement and automatically retrieves textual evidence from a background corpora for life sciences. EVIDENCEMINER is constructed in a completely automated way without any human effort for training data annotation. It is supported by novel data-driven methods for distantly supervised named entity recognition and open information extraction. The entities and patterns are pre-computed and indexed offline to support fast online evidence retrieval. The annotation results are also highlighted in the original document for better visualization. EVIDENCEMINER also includes analytic functionalities such as the most frequent entity and relation summarization. EVIDENCEMINER can help scientists uncover important research issues, leading to more effective research and more in-depth quantitative analysis. The system of EVIDENCEMINER is available at https://evidenceminer.firebaseapp.com/.


Variational Autoregressive Decoder for Neural Response Generation
Jiachen Du | Wenjie Li | Yulan He | Ruifeng Xu | Lidong Bing | Xuan Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Combining the virtues of probability graphic models and neural networks, Conditional Variational Auto-encoder (CVAE) has shown promising performance in applications such as response generation. However, existing CVAE-based models often generate responses from a single latent variable which may not be sufficient to model high variability in responses. To solve this problem, we propose a novel model that sequentially introduces a series of latent variables to condition the generation of each word in the response sequence. In addition, the approximate posteriors of these latent variables are augmented with a backward Recurrent Neural Network (RNN), which allows the latent variables to capture long-term dependencies of future tokens in generation. To facilitate training, we supplement our model with an auxiliary objective that predicts the subsequent bag of words. Empirical experiments conducted on Opensubtitle and Reddit datasets show that the proposed model leads to significant improvement on both relevance and diversity over state-of-the-art baselines.


Life-iNet: A Structured Network-Based Knowledge Exploration and Analytics System for Life Sciences
Xiang Ren | Jiaming Shen | Meng Qu | Xuan Wang | Zeqiu Wu | Qi Zhu | Meng Jiang | Fangbo Tao | Saurabh Sinha | David Liem | Peipei Ping | Richard Weinshilboum | Jiawei Han
Proceedings of ACL 2017, System Demonstrations

XJNLP at SemEval-2017 Task 12: Clinical temporal information ex-traction with a Hybrid Model
Yu Long | Zhijing Li | Xuan Wang | Chen Li
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Temporality is crucial in understanding the course of clinical events from a patient’s electronic health recordsand temporal processing is becoming more and more important for improving access to content.SemEval 2017 Task 12 (Clinical TempEval) addressed this challenge using the THYME corpus, a corpus of clinical narratives annotated with a schema based on TimeML2 guidelines. We developed and evaluated approaches for: extraction of temporal expressions (TIMEX3) and EVENTs; EVENT attributes; document-time relations. Our approach is a hybrid model which is based on rule based methods, semi-supervised learning, and semantic features with addition of manually crafted rules.


Improving Distributed Representation of Word Sense via WordNet Gloss Composition and Context Clustering
Tao Chen | Ruifeng Xu | Yulan He | Xuan Wang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)


Simple Maximum Entropy Models for Multilingual Coreference Resolution
Xinxin Li | Xuan Wang | Xingwei Liao
Joint Conference on EMNLP and CoNLL - Shared Task

A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language
Sajjad Ahmad Khan | Waqas Anwar | Usama Ijaz Bajwa | Xuan Wang
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language
Faryal Jahangir | Waqas Anwar | Usama Ijaz Bajwa | Xuan Wang
Proceedings of the 10th Workshop on Asian Language Resources

Zhijun Wu: Chinese Semantic Dependency Parsing with Third-Order Features
Zhijun Wu | Xuan Wang | Xinxin Li
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)


Diversifying Information Needs in Results of Question Retrieval
Yaoyun Zhang | Xiaolong Wang | Xuan Wang | Ruifeng Xu | Jun Xu | ShiXi Fan
Proceedings of 5th International Joint Conference on Natural Language Processing

Coreference Resolution with Loose Transitivity Constraints
Xinxin Li | Xuan Wang | Shuhan Qi
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task


pdf bib
A Cascade Method for Detecting Hedges and their Scope in Natural Language Text
Buzhou Tang | Xiaolong Wang | Xuan Wang | Bo Yuan | Shixi Fan
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

Exploiting Rich Features for Detecting Hedges and their Scope
Xinxin Li | Jianping Shen | Xiang Gao | Xuan Wang
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

Chinese Word Segmentation based on Mixing Multiple Preprocessor and CRF
Jianping Shen | Xuan Wang | Hainan Zhao | Wenxiao Zhang
CIPS-SIGHAN Joint Conference on Chinese Language Processing


A Joint Syntactic and Semantic Dependency Parsing System based on Maximum Entropy Models
Buzhou Tang | Lu Li | Xinxin Li | Xuan Wang | Xiaolong Wang
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task


pdf bib
Semantic Chunk Annotation for complex questions using Conditional Random Field
Shixi Fan | Yaoyun Zhang | Wing W. Y. Ng | Xuan Wang | Xiaolong Wang
Coling 2008: Proceedings of the workshop on Knowledge and Reasoning for Answering Questions

Discriminative Learning of Syntactic and Semantic Dependencies
Lu Li | Shixi Fan | Xuan Wang | Xiaolong Wang
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

Chunking with Max-Margin Markov Networks
Buzhou Tang | Xuan Wang | Xiaolong Wang
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation