Daqing He


2021

pdf bib
An Empirical Study on Neural Keyphrase Generation
Rui Meng | Xingdi Yuan | Tong Wang | Sanqiang Zhao | Adam Trischler | Daqing He
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent years have seen a flourishing of neural keyphrase generation (KPG) works, including the release of several large-scale datasets and a host of new models to tackle them. Model performance on KPG tasks has increased significantly with evolving deep learning research. However, there lacks a comprehensive comparison among different model designs, and a thorough investigation on related factors that may affect a KPG system’s generalization performance. In this empirical study, we aim to fill this gap by providing extensive experimental results and analyzing the most crucial factors impacting the generalizability of KPG models. We hope this study can help clarify some of the uncertainties surrounding the KPG task and facilitate future research on this topic.

pdf bib
Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents
Rui Meng | Khushboo Thaker | Lei Zhang | Yue Dong | Xingdi Yuan | Tong Wang | Daqing He
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Faceted summarization provides briefings of a document from different perspectives. Readers can quickly comprehend the main points of a long document with the help of a structured outline. However, little research has been conducted on this subject, partially due to the lack of large-scale faceted summarization datasets. In this study, we present FacetSum, a faceted summarization benchmark built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value. Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries. We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems that can leverage the structured information in both long texts and summaries.

2020

pdf bib
One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases
Xingdi Yuan | Tong Wang | Rui Meng | Khushboo Thaker | Peter Brusilovsky | Daqing He | Adam Trischler
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Different texts shall by nature correspond to different number of keyphrases. This desideratum is largely missing from existing neural keyphrase generation models. In this study, we address this problem from both modeling and evaluation perspectives. We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs. We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset StackEx that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

2018

pdf bib
Integrating Transformer and Paraphrase Rules for Sentence Simplification
Sanqiang Zhao | Rui Meng | Daqing He | Andi Saptono | Bambang Parmanto
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Sentence simplification aims to reduce the complexity of a sentence while retaining its original meaning. Current models for sentence simplification adopted ideas from machine translation studies and implicitly learned simplification mapping rules from normal-simple sentence pairs. In this paper, we explore a novel model based on a multi-layer and multi-head attention architecture and we propose two innovative approaches to integrate the Simple PPDB (A Paraphrase Database for Simplification), an external paraphrase knowledge base for simplification that covers a wide range of real-world simplification rules. The experiments show that the integration provides two major benefits: (1) the integrated model outperforms multiple state-of-the-art baseline models for sentence simplification in the literature (2) through analysis of the rule utilization, the model seeks to select more accurate simplification rules. The code and models used in the paper are available at https://github.com/Sanqiang/text_simplification.

2017

pdf bib
Deep Keyphrase Generation
Rui Meng | Sanqiang Zhao | Shuguang Han | Daqing He | Peter Brusilovsky | Yu Chi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Keyphrase provides highly-summative information that can be effectively used for understanding, organizing and retrieving text content. Though previous studies have provided many workable solutions for automated keyphrase extraction, they commonly divided the to-be-summarized content into multiple text chunks, then ranked and selected the most meaningful ones. These approaches could neither identify keyphrases that do not appear in the text, nor capture the real semantic meaning behind the text. We propose a generative model for keyphrase prediction with an encoder-decoder framework, which can effectively overcome the above drawbacks. We name it as deep keyphrase generation since it attempts to capture the deep semantic meaning of the content with a deep learning method. Empirical analysis on six datasets demonstrates that our proposed model not only achieves a significant performance boost on extracting keyphrases that appear in the source text, but also can generate absent keyphrases based on the semantic meaning of the text. Code and dataset are available at https://github.com/memray/seq2seq-keyphrase.

pdf bib
Enhancing Automatic ICD-9-CM Code Assignment for Medical Texts with PubMed
Danchen Zhang | Daqing He | Sanqiang Zhao | Lei Li
BioNLP 2017

Assigning a standard ICD-9-CM code to disease symptoms in medical texts is an important task in the medical domain. Automating this process could greatly reduce the costs. However, the effectiveness of an automatic ICD-9-CM code classifier faces a serious problem, which can be triggered by unbalanced training data. Frequent diseases often have more training data, which helps its classification to perform better than that of an infrequent disease. However, a disease’s frequency does not necessarily reflect its importance. To resolve this training data shortage problem, we propose to strategically draw data from PubMed to enrich the training data when there is such need. We validate our method on the CMC dataset, and the evaluation results indicate that our method can significantly improve the code assignment classifiers’ performance at the macro-averaging level.

2003

pdf bib
Desparately Seeking Cebuano
Douglas W. Oard | David Doermann | Bonnie Dorr | Daqing He | Philip Resnik | Amy Weinberg | William Byrne | Sanjeev Khudanpur | David Yarowsky | Anton Leuski | Philipp Koehn | Kevin Knight
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

1997

pdf bib
Referring to Displays in Multimodal Interfaces
Daqing He | Graeme Ritchie | John Lee
Referring Phenomena in a Multimedia Context and their Computational Treatment