Kuansan Wang


2021

pdf bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

pdf
Overview of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

pdf
SciConceptMiner: A system for large-scale scientific concept discovery
Zhihong Shen | Chieh-Han Wu | Li Ma | Chien-Pang Chen | Kuansan Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

Scientific knowledge is evolving at an unprecedented rate of speed, with new concepts constantly being introduced from millions of academic articles published every month. In this paper, we introduce a self-supervised end-to-end system, SciConceptMiner, for the automatic capture of emerging scientific concepts from both independent knowledge sources (semi-structured data) and academic publications (unstructured documents). First, we adopt a BERT-based sequence labeling model to predict candidate concept phrases with self-supervision data. Then, we incorporate rich Web content for synonym detection and concept selection via a web search API. This two-stage approach achieves highly accurate (94.7%) concept identification with more than 740K scientific concepts. These concepts are deployed in the Microsoft Academic production system and are the backbone for its semantic search capability.

2020

pdf bib
CORD-19: The COVID-19 Open Research Dataset
Lucy Lu Wang | Kyle Lo | Yoganand Chandrasekhar | Russell Reas | Jiangjiang Yang | Doug Burdick | Darrin Eide | Kathryn Funk | Yannis Katsis | Rodney Michael Kinney | Yunyao Li | Ziyang Liu | William Merrill | Paul Mooney | Dewey A. Murdick | Devvret Rishi | Jerry Sheehan | Zhihong Shen | Brandon Stilson | Alex D. Wade | Kuansan Wang | Nancy Xin Ru Wang | Christopher Wilhelm | Boya Xie | Douglas M. Raymond | Daniel S. Weld | Oren Etzioni | Sebastian Kohlmeier
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.

pdf
Explainable and Sparse Representations of Academic Articles for Knowledge Exploration
Keng-Te Liao | Zhihong Shen | Chiyuan Huang | Chieh-Han Wu | PoChun Chen | Kuansan Wang | Shou-de Lin
Proceedings of the 28th International Conference on Computational Linguistics

We focus on a recently deployed system built for summarizing academic articles by concept tagging. The system has shown great coverage and high accuracy of concept identification which could be contributed by the knowledge acquired from millions of publications. Provided with the interpretable concepts and knowledge encoded in a pre-trained neural model, we investigate whether the tagged concepts can be applied to a broader class of applications. We propose transforming the tagged concepts into sparse vectors as representations of academic documents. The effectiveness of the representations is analyzed theoretically by a proposed framework. We also empirically show that the representations can have advantages on academic topic discovery and paper recommendation. On these applications, we reveal that the knowledge encoded in the tagging system can be effectively utilized and can help infer additional features from data with limited information.

2018

pdf
A Web-scale system for scientific knowledge exploration
Zhihong Shen | Hao Ma | Kuansan Wang
Proceedings of ACL 2018, System Demonstrations

To enable efficient exploration of Web-scale scientific knowledge, it is necessary to organize scientific publications into a hierarchical concept structure. In this work, we present a large-scale system to (1) identify hundreds of thousands of scientific concepts, (2) tag these identified concepts to hundreds of millions of scientific publications by leveraging both text and graph structure, and (3) build a six-level concept hierarchy with a subsumption-based model. The system builds the most comprehensive cross-domain scientific concept ontology published to date, with more than 200 thousand concepts and over one million relationships.

2010

pdf
An Overview of Microsoft Web N-gram Corpus and Applications
Kuansan Wang | Chris Thrasher | Evelyne Viegas | Xiaolong Li | Bo-june Paul Hsu
Proceedings of the NAACL HLT 2010 Demonstration Session

2004

pdf
Use and Acquisition of Semantic Language Model
Kuansan Wang | Ye-Yi Wang | Alex Acero
Demonstration Papers at HLT-NAACL 2004

2002

pdf
SALT: An XML Application for Web-based Multimodal Dialog Management
Kuansan Wang
COLING-02: The 2nd Workshop on NLP and XML (NLPXML-2002)