Nianwen Xue

2024

We explore using LLMs, GPT-4 specifically, to generate draft sentence-level Chinese Uniform Meaning Representations (UMRs) that human annotators can revise to speed up the UMR annotation process. In this study, we use few-shot learning and Think-Aloud prompting to guide GPT-4 to generate sentence-level graphs of UMR. Our experimental results show that compared with annotating UMRs from scratch, using LLMs as a preprocessing step reduces the annotation time by two thirds on average. This indicates that there is great potential for integrating LLMs into the pipeline for complicated semantic annotation tasks.

pdf abs
A Pipeline Approach for Parsing Documents into Uniform Meaning Representation Graphs
Jayeol Chun | Nianwen Xue
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

Uniform Meaning Representation (UMR) is the next phase of semantic formalism following Abstract Meaning Representation (AMR), with added focus on inter-sentential relations allowing the representational scope of UMR to cover a full document.This, in turn, greatly increases the complexity of its parsing task with the additional requirement of capturing document-level linguistic phenomena such as coreference, modal and temporal dependencies.In order to establish a strong baseline despite the small size of recently released UMR v1.0 corpus, we introduce a pipeline model that does not require any training.At the core of our method is a two-track strategy of obtaining UMR’s sentence and document graphs separately, with the document-level triples being compiled at the token level and the sentence graph being converted from AMR graphs.By leveraging alignment between AMR and its sentence, we are able to generate the first automatic English UMR parses.

pdf bib
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Nicoletta Calzolari | Min-Yen Kan | Veronique Hoste | Alessandro Lenci | Sakriani Sakti | Nianwen Xue
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

pdf abs
Anchor and Broadcast: An Efficient Concept Alignment Approach for Evaluation of Semantic Graphs
Haibo Sun | Nianwen Xue
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we present AnCast, an intuitive and efficient tool for evaluating graph-based meaning representations (MR). AnCast implements evaluation metrics that are well understood in the NLP community, and they include concept F1, unlabeled relation F1, labeled relation F1, and weighted relation F1. The efficiency of the tool comes from a novel anchor broadcast alignment algorithm that is not subject to the trappings of local maxima. We show through experimental results that the AnCast score is highly correlated with the widely used Smatch score, but its computation takes only about 40% the time.

This paper reports the first release of the UMR (Uniform Meaning Representation) data set. UMR is a graph-based meaning representation formalism consisting of a sentence-level graph and a document-level graph. The sentence-level graph represents predicate-argument structures, named entities, word senses, aspectuality of events, as well as person and number information for entities. The document-level graph represents coreferential, temporal, and modal relations that go beyond sentence boundaries. UMR is designed to capture the commonalities and variations across languages and this is done through the use of a common set of abstract concepts, relations, and attributes as well as concrete concepts derived from words from invidual languages. This UMR release includes annotations for six languages (Arapaho, Chinese, English, Kukama, Navajo, Sanapana) that vary greatly in terms of their linguistic properties and resource availability. We also describe on-going efforts to enlarge this data set and extend it to other genres and modalities. We also briefly describe the available infrastructure (UMR annotation guidelines and tools) that others can use to create similar data sets.

pdf abs
Meaning Representations for Natural Languages: Design, Models and Applications
Julia Bonn | Jeffrey Flanigan | Jan Hajič | Ishan Jindal | Yunyao Li | Nianwen Xue
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries

This tutorial reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We propose a cutting-edge, full-day tutorial for all stakeholders in the AI community, including NLP researchers, domain-specific practitioners, and students

2023

pdf abs
Cross-Document Event Coreference Resolution: Instruct Humans or Instruct GPT?
Jin Zhao | Nianwen Xue | Bonan Min
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

This paper explores utilizing Large Language Models (LLMs) to perform Cross-Document Event Coreference Resolution (CDEC) annotations and evaluates how they fare against human annotators with different levels of training. Specifically, we formulate CDEC as a multi-category classification problem on pairs of events that are represented as decontextualized sentences, and compare the predictions of GPT-4 with the judgment of fully trained annotators and crowdworkers on the same data set. Our study indicates that GPT-4 with zero-shot learning outperformed crowd-workers by a large margin and exhibits a level of performance comparable to trained annotators. Upon closer analysis, GPT-4 also exhibits tendencies of being overly confident, and force annotation decisions even when such decisions are not warranted due to insufficient information. Our results have implications on how to perform complicated annotations such as CDEC in the age of LLMs, and show that the best way to acquire such annotations might be to combine the strengths of LLMs and trained human annotators in the annotation process, and using untrained or undertrained crowdworkers is no longer a viable option to acquire high-quality data to advance the state of the art for such problems.

pdf abs
UMR annotation of Chinese Verb compounds and related constructions
Haibo Sun | Yifan Zhu | Jin Zhao | Nianwen Xue
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)

This paper discusses the challenges of annotating the predicate-argument structure of Chinese verb compounds in Uniform Meaning Representation (UMR), a recent meaning representation framework that extends Abstract Meaning Representation (AMR) to cross-linguistic settings. The key issue is to decide whether to annotate the argument structure of a verb compound as a whole, or to annotate the argument structure of their component verbs as well as the relations between them. We examine different types of Chinese verb compounds, and propose how to annotate them based on the principle of compositionality, level of grammaticalization, and productivity of component verbs. We propose a solution to the practical problem of having to define the semantic roles for Chinese verb compounds that are quite open-ended by separating compositional verb compounds from verb compounds that are non-compositional or have grammaticalized verb components. For compositional verb compounds, instead of annotating the argument structure of the verb compound as a whole, we annotate the argument structure of the component verbs as well as the semantic relations between them as creating an exhaustive list of such verb compounds is infeasible. Verb compounds with grammaticalized verb components also tend to be productive and we represent grammaticalized verb compounds as either attributes of the primary verb or as relations.

pdf abs
A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Annemarie Friedrich | Nianwen Xue | Alexis Palmer
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Aspectual meaning refers to how the internal temporal structure of situations is presented. This includes whether a situation is described as a state or as an event, whether the situation is finished or ongoing, and whether it is viewed as a whole or with a focus on a particular phase. This survey gives an overview of computational approaches to modeling lexical and grammatical aspect along with intuitive explanations of the necessary linguistic concepts and terminology. In particular, we describe the concepts of stativity, telicity, habituality, perfective and imperfective, as well as influential inventories of eventuality and situation types. Aspect is a crucial component of semantics, especially for precise reporting of the temporal structure of situations, and future NLP approaches need to be able to handle and evaluate it systematically.

This paper presents detailed mappings between the structures used in Abstract Meaning Representation (AMR) and those used in Uniform Meaning Representation (UMR). These structures include general semantic roles, rolesets, and concepts that are largely shared between AMR and UMR, but with crucial differences. While UMR annotation of new low-resource languages is ongoing, AMR-annotated corpora already exist for many languages, and these AMR corpora are ripe for conversion to UMR format. Rather than focusing on semantic coverage that is new to UMR (which will likely need to be dealt with manually), this paper serves as a resource (with illustrated mappings) for users looking to understand the fine-grained adjustments that have been made to the representation techniques for semantic categoriespresent in both AMR and UMR.

UMR-Writer is a web-based tool for annotating semantic graphs with the Uniform Meaning Representation (UMR) scheme. UMR is a graph-based semantic representation that can be applied cross-linguistically for deep semantic analysis of texts. In this work, we implemented a new keyboard interface in UMR-Writer 2.0, which is a powerful addition to the original mouse interface, supporting faster annotation for more experienced annotators. The new interface also addresses issues with the original mouse interface. Additionally, we demonstrate an efficient workflow for annotation project management in UMR-Writer 2.0, which has been applied to many projects.

pdf bib
Proceedings of the Fourth International Workshop on Designing Meaning Representations
Julia Bonn | Nianwen Xue
Proceedings of the Fourth International Workshop on Designing Meaning Representations

Rooted in AMR, Uniform Meaning Representation (UMR) is a graph-based formalism with nodes as concepts and edges as relations between them. When used to represent natural language semantics, UMR maps words in a sentence to concepts in the UMR graph. Multiword expressions (MWEs) pose a particular challenge to UMR annotation because they deviate from the default one-to-one mapping between words and concepts. There are different types of MWEs which require different kinds of annotation that must be specified in guidelines. This paper discusses the specific treatment for each type of MWE in UMR.

2022

pdf abs
Modal Dependency Parsing via Language Model Priming
Jiarui Yao | Nianwen Xue | Bonan Min
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The task of modal dependency parsing aims to parse a text into its modal dependency structure, which is a representation for the factuality of events in the text. We design a modal dependency parser that is based on priming pre-trained language models, and evaluate the parser on two data sets. Compared to baselines, we show an improvement of 2.6% in F-score for English and 4.6% for Chinese. To the best of our knowledge, this is also the first work on Chinese modal dependency parsing.

pdf bib
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Daphne Ippolito | Liunian Harold Li | Maria Leonor Pacheco | Danqi Chen | Nianwen Xue
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

pdf bib abs
Meaning Representations for Natural Languages: Design, Models and Applications
Jeffrey Flanigan | Ishan Jindal | Yunyao Li | Tim O’Gorman | Martha Palmer | Nianwen Xue
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.

2021

pdf abs
Factuality Assessment as Modal Dependency Parsing
Jiarui Yao | Haoling Qiu | Jin Zhao | Bonan Min | Nianwen Xue
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

As the sources of information that we consume everyday rapidly diversify, it is becoming increasingly important to develop NLP tools that help to evaluate the credibility of the information we receive. A critical step towards this goal is to determine the factuality of events in text. In this paper, we frame factuality assessment as a modal dependency parsing task that identifies the events and their sources, formally known as conceivers, and then determine the level of certainty that the sources are asserting with respect to the events. We crowdsource the first large-scale data set annotated with modal dependency structures that consists of 353 Covid-19 related news articles, 24,016 events, and 2,938 conceivers. We also develop the first modal dependency parser that jointly extracts events, conceivers and constructs the modal dependency structure of a text. We evaluate the joint model against a pipeline model and demonstrate the advantage of the joint model in conceiver extraction and modal dependency structure construction when events and conceivers are automatically extracted. We believe the dataset and the models will be a valuable resource for a whole host of NLP applications such as fact checking and rumor detection.

pdf abs
A Joint Model for Dropped Pronoun Recovery and Conversational Discourse Parsing in Chinese Conversational Speech
Jingxuan Yang | Kerui Xu | Jun Xu | Si Li | Sheng Gao | Jun Guo | Nianwen Xue | Ji-Rong Wen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we present a neural model for joint dropped pronoun recovery (DPR) and conversational discourse parsing (CDP) in Chinese conversational speech. We show that DPR and CDP are closely related, and a joint model benefits both tasks. We refer to our model as DiscProReco, and it first encodes the tokens in each utterance in a conversation with a directed Graph Convolutional Network (GCN). The token states for an utterance are then aggregated to produce a single state for each utterance. The utterance states are then fed into a biaffine classifier to construct a conversational discourse graph. A second (multi-relational) GCN is then applied to the utterance states to produce a discourse relation-augmented representation for the utterances, which are then fused together with token states in each utterance as input to a dropped pronoun recovery layer. The joint model is trained and evaluated on a new Structure Parsing-enhanced Dropped Pronoun Recovery (SPDPR) data set that we annotated with both two types of information. Experimental results on the SPDPR dataset and other benchmarks show that DiscProReco significantly outperforms the state-of-the-art baselines of both tasks.

pdf bib
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop
Claire Bonial | Nianwen Xue
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

pdf abs
ExcavatorCovid: Extracting Events and Relations from Text Corpora for Temporal and Causal Analysis for COVID-19
Bonan Min | Benjamin Rozonoyer | Haoling Qiu | Alexander Zamanian | Nianwen Xue | Jessica MacBride
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Timely responses from policy makers to mitigate the impact of the COVID-19 pandemic rely on a comprehensive grasp of events, their causes, and their impacts. These events are reported at such a speed and scale as to be overwhelming. In this paper, we present ExcavatorCovid, a machine reading system that ingests open-source text documents (e.g., news and scientific publications), extracts COVID-19 related events and relations between them, and builds a Temporal and Causal Analysis Graph (TCAG). Excavator will help government agencies alleviate the information overload, understand likely downstream effects of political and economic decisions and events related to the pandemic, and respond in a timely manner to mitigate the impact of COVID-19. We expect the utility of Excavator to outlive the COVID-19 pandemic: analysts and decision makers will be empowered by Excavator to better understand and solve complex problems in the future. A demonstration video is available at https://vimeo.com/528619007.

pdf abs
UMR-Writer: A Web Application for Annotating Uniform Meaning Representations
Jin Zhao | Nianwen Xue | Jens Van Gysel | Jinho D. Choi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present UMR-Writer, a web-based application for annotating Uniform Meaning Representations (UMR), a graph-based, cross-linguistically applicable semantic representation developed recently to support the development of interpretable natural language applications that require deep semantic analysis of texts. We present the functionalities of UMR-Writer and discuss the challenges in developing such a tool and how they are addressed.

2020

pdf
Abstract Meaning Representation for MWE: A study of the mapping of aspectuality based on Mandarin light verb jiayi
Lu Lu | Nianwen Xue | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

Pronouns are often dropped in Chinese conversations and recovering the dropped pronouns is important for NLP applications such as Machine Translation. Existing approaches usually formulate this as a sequence labeling task of predicting whether there is a dropped pronoun before each token and its type. Each utterance is considered to be a sequence and labeled independently. Although these approaches have shown promise, labeling each utterance independently ignores the dependencies between pronouns in neighboring utterances. Modeling these dependencies is critical to improving the performance of dropped pronoun recovery. In this paper, we present a novel framework that combines the strength of Transformer network with General Conditional Random Fields (GCRF) to model the dependencies between pronouns in neighboring utterances. Results on three Chinese conversation datasets show that the Transformer-GCRF model outperforms the state-of-the-art dropped pronoun recovery models. Exploratory analysis also demonstrates that the GCRF did help to capture the dependencies between pronouns in neighboring utterances, thus contributes to the performance improvements.

pdf abs
Annotating Temporal Dependency Graphs via Crowdsourcing
Jiarui Yao | Haoling Qiu | Bonan Min | Nianwen Xue
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present the construction of a corpus of 500 Wikinews articles annotated with temporal dependency graphs (TDGs) that can be used to train systems to understand temporal relations in text. We argue that temporal dependency graphs, built on previous research on narrative times and temporal anaphora, provide a representation scheme that achieves a good trade-off between completeness and practicality in temporal annotation. We also provide a crowdsourcing strategy to annotate TDGs, and demonstrate the feasibility of this approach with an evaluation of the quality of the annotation, and the utility of the resulting data set by training a machine learning model on this data set. The data set is publicly available.

2019

pdf abs
Acquiring Structured Temporal Representation via Crowdsourcing: A Feasibility Study
Yuchen Zhang | Nianwen Xue
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Temporal Dependency Trees are a structured temporal representation that represents temporal relations among time expressions and events in a text as a dependency tree structure. Compared to traditional pair-wise temporal relation representations, temporal dependency trees facilitate efficient annotations, higher inter-annotator agreement, and efficient computations. However, annotations on temporal dependency trees so far have only been done by expert annotators, which is costly and time-consuming. In this paper, we introduce a method to crowdsource temporal dependency tree annotations, and show that this representation is intuitive and can be collected with high accuracy and agreement through crowdsourcing. We produce a corpus of temporal dependency trees, and present a baseline temporal dependency parser, trained and evaluated on this new corpus.

pdf bib
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

pdf bib abs
Building a Chinese AMR Bank with Concept and Relation Alignments
Bin Li | Yuan Wen | Li Song | Weiguang Qu | Nianwen Xue
Linguistic Issues in Language Technology, Volume 18, 2019 - Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing

Abstract Meaning Representation (AMR) is a meaning representation framework in which the meaning of a full sentence is represented as a single-rooted, acyclic, directed graph. In this article, we describe an on-going project to build a Chinese AMR (CAMR) corpus, which currently includes 10,149 sentences from the newsgroup and weblog portion of the Chinese TreeBank (CTB). We describe the annotation specifications for the CAMR corpus, which follow the annotation principles of English AMR but make adaptations where needed to accommodate the linguistic facts of Chinese. The CAMR specifications also include a systematic treatment of sentence-internal discourse relations. One significant change we have made to the AMR annotation methodology is the inclusion of the alignment between word tokens in the sentence and the concepts/relations in the CAMR annotation to make it easier for automatic parsers to model the correspondence between a sentence and its meaning representation. We develop an annotation tool for CAMR, and the inter-agreement as measured by the Smatch score between the two annotators is 0.83, indicating reliable annotation. We also present some quantitative analysis of the CAMR corpus. 46.71% of the AMRs of the sentences are non-tree graphs. Moreover, the AMR of 88.95% of the sentences has concepts inferred from the context of the sentence but do not correspond to a specific word.

pdf abs
Recovering dropped pronouns in Chinese conversations via modeling their referents
Jingxuan Yang | Jianzhuo Tong | Si Li | Sheng Gao | Jun Guo | Nianwen Xue
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Pronouns are often dropped in Chinese sentences, and this happens more frequently in conversational genres as their referents can be easily understood from context. Recovering dropped pronouns is essential to applications such as Information Extraction where the referents of these dropped pronouns need to be resolved, or Machine Translation when Chinese is the source language. In this work, we present a novel end-to-end neural network model to recover dropped pronouns in conversational data. Our model is based on a structured attention mechanism that models the referents of dropped pronouns utilizing both sentence-level and word-level information. Results on three different conversational genres show that our approach achieves a significant improvement over the current state of the art.

pdf abs
Modeling Quantification and Scope in Abstract Meaning Representations
James Pustejovsky | Ken Lai | Nianwen Xue
Proceedings of the First International Workshop on Designing Meaning Representations

In this paper, we propose an extension to Abstract Meaning Representations (AMRs) to encode scope information of quantifiers and negation, in a way that overcomes the semantic gaps of the schema while maintaining its cognitive simplicity. Specifically, we address three phenomena not previously part of the AMR specification: quantification, negation (generally), and modality. The resulting representation, which we call “Uniform Meaning Representation” (UMR), adopts the predicative core of AMR and embeds it under a “scope” graph when appropriate. UMR representations differ from other treatments of quantification and modal scope phenomena in two ways: (a) they are more transparent; and (b) they specify default scope when possible.‘

pdf abs
Parsing Meaning Representations: Is Easier Always Better?
Zi Lin | Nianwen Xue
Proceedings of the First International Workshop on Designing Meaning Representations

The parsing accuracy varies a great deal for different meaning representations. In this paper, we compare the parsing performances between Abstract Meaning Representation (AMR) and Minimal Recursion Semantics (MRS), and provide an in-depth analysis of what factors contributed to the discrepancy in their parsing accuracy. By crystalizing the trade-off between representation expressiveness and ease of automatic parsing, we hope our results can help inform the design of the next-generation meaning representations.

2018

pdf abs
Transition-Based Chinese AMR Parsing
Chuan Wang | Bin Li | Nianwen Xue
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This paper presents the first AMR parser built on the Chinese AMR bank. By applying a transition-based AMR parsing framework to Chinese, we first investigate how well the transitions first designed for English AMR parsing generalize to Chinese and provide a comparative analysis between the transitions for English and Chinese. We then perform a detailed error analysis to identify the major challenges in Chinese AMR parsing that we hope will inform future research in this area.

pdf
Structured Interpretation of Temporal Relations
Yuchen Zhang | Nianwen Xue
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Neural Ranking Models for Temporal Dependency Structure Parsing
Yuchen Zhang | Nianwen Xue
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We design and build the first neural temporal dependency parser. It utilizes a neural ranking model with minimal feature engineering, and parses time expressions and events in a text into a temporal dependency tree structure. We evaluate our parser on two domains: news reports and narrative stories. In a parsing-only evaluation setup where gold time expressions and events are provided, our parser reaches 0.81 and 0.70 f-score on unlabeled and labeled parsing respectively, a result that is very competitive against alternative approaches. In an end-to-end evaluation setup where time expressions and events are automatically recognized, our parser beats two strong baselines on both data domains. Our experimental results and discussions shed light on the nature of temporal dependency structures in different domains and provide insights that we believe will be valuable to future research in this area.

2017

pdf bib abs
Translation Divergences in Chinese–English Machine Translation: An Empirical Investigation
Dun Deng | Nianwen Xue
Computational Linguistics, Volume 43, Issue 3 - September 2017

In this article, we conduct an empirical investigation of translation divergences between Chinese and English relying on a parallel treebank. To do this, we first devise a hierarchical alignment scheme where Chinese and English parse trees are aligned in a way that eliminates conflicts and redundancies between word alignments and syntactic parses to prevent the generation of spurious translation divergences. Using this Hierarchically Aligned Chinese–English Parallel Treebank (HACEPT), we are able to semi-automatically identify and categorize the translation divergences between the two languages and quantify each type of translation divergence. Our results show that the translation divergences are much broader than described in previous studies that are largely based on anecdotal evidence and linguistic knowledge. The distribution of the translation divergences also shows that some high-profile translation divergences that motivate previous research are actually very rare in our data, whereas other translation divergences that have previously received little attention actually exist in large quantities. We also show that HACEPT allows the extraction of syntax-based translation rules, most of which are expressive enough to capture the translation divergences, and point out that the syntactic annotation in existing treebanks is not optimal for extracting such translation rules. We also discuss the implications of our study for attempts to bridge translation divergences by devising shared semantic representations across languages. Our quantitative results lend further support to the observation that although it is possible to bridge some translation divergences with semantic representations, other translation divergences are open-ended, thus building a semantic representation that captures all possible translation divergences may be impractical.

pdf bib
Proceedings of the IJCNLP 2017, Shared Tasks
Chao-Hong Liu | Preslav Nakov | Nianwen Xue
Proceedings of the IJCNLP 2017, Shared Tasks

pdf abs
Getting the Most out of AMR Parsing
Chuan Wang | Nianwen Xue
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper proposes to tackle the AMR parsing bottleneck by improving two components of an AMR parser: concept identification and alignment. We first build a Bidirectional LSTM based concept identifier that is able to incorporate richer contextual information to learn sparse AMR concept labels. We then extend an HMM-based word-to-concept alignment model with graph distance distortion and a rescoring method during decoding to incorporate the structural information in the AMR graph. We show integrating the two components into an existing AMR parser results in consistently better performance over the state of the art on various datasets.

pdf abs
A Systematic Study of Neural Discourse Models for Implicit Discourse Relation
Attapol Rutherford | Vera Demberg | Nianwen Xue
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Inferring implicit discourse relations in natural language text is the most difficult subtask in discourse parsing. Many neural network models have been proposed to tackle this problem. However, the comparison for this task is not unified, so we could hardly draw clear conclusions about the effectiveness of various architectures. Here, we propose neural network models that are based on feedforward and long-short term memory architecture and systematically study the effects of varying structures. To our surprise, the best-configured feedforward architecture outperforms LSTM-based model in most cases despite thorough tuning. Further, we compare our best feedforward system with competitive convolutional and recurrent networks and find that feedforward can actually be more effective. For the first time for this task, we compile and publish outputs from previous neural and non-neural systems to establish the standard for further comparison.

pdf abs
Addressing the Data Sparsity Issue in Neural AMR Parsing
Xiaochang Peng | Chuan Wang | Daniel Gildea | Nianwen Xue
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Neural attention models have achieved great success in different NLP tasks. However, they have not fulfilled their promise on the AMR parsing task due to the data sparsity issue. In this paper, we describe a sequence-to-sequence model for AMR parsing and present different ways to tackle the data sparsity problem. We show that our methods achieve significant improvement over a baseline neural attention model and our results are also competitive against state-of-the-art systems that do not use extra linguistic resources.

pdf bib
Proceedings of the 11th Linguistic Annotation Workshop
Nathan Schneider | Nianwen Xue
Proceedings of the 11th Linguistic Annotation Workshop

pdf
Discourse Segmentation for Building a RST Chinese Treebank
Shuyuan Cao | Nianwen Xue | Iria da Cunha | Mikel Iruskieta | Chuan Wang
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms

2016

pdf
CAMR at SemEval-2016 Task 8: An Extended Transition-based AMR Parser
Chuan Wang | Sameer Pradhan | Xiaoman Pan | Heng Ji | Nianwen Xue
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Annotating the Little Prince with Chinese AMRs
Bin Li | Yuan Wen | Weiguang Qu | Lijun Bu | Nianwen Xue
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf
Converting SynTagRus Dependency Treebank into Penn Treebank Style
Alex Luu | Sophia A. Malamud | Nianwen Xue
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf
Annotating the discourse and dialogue structure of SMS message conversations
Nianwen Xue | Qishen Su | Sooyoung Jeong
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program at DARPA (Defense Advanced Research Projects Agency) successfully addressed the internet information with enhanced NLP systems. BOLT aims for automated translation and linguistic analysis for informal genres of text and speech in online and in-person communication. As a part of this program, the Linguistic Data Consortium (LDC) developed valuable linguistic resources in support of the training and evaluation of such new technologies. This paper focuses on methodologies, infrastructure, and procedure for developing linguistic annotation at various language levels, including Treebank (TB), word alignment (WA), PropBank (PB), and co-reference (CoRef). Inspired by the OntoNotes approach with adaptations to the tasks to reflect the goals and scope of the BOLT project, this effort has introduced more annotation types of informal and free-style genres in English, Chinese and Egyptian Arabic. The corpus produced is by far the largest multi-lingual, multi-level and multi-genre annotation corpus of informal text and speech.

pdf bib
Proceedings of the CoNLL-16 shared task
Nianwen Xue
Proceedings of the CoNLL-16 shared task

pdf
Robust Non-Explicit Neural Discourse Parser in English and Chinese
Attapol Rutherford | Nianwen Xue
Proceedings of the CoNLL-16 shared task

2015

pdf bib
The CoNLL-2015 Shared Task on Shallow Discourse Parsing
Nianwen Xue | Hwee Tou Ng | Sameer Pradhan | Rashmi Prasad | Christopher Bryant | Attapol Rutherford
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task

pdf
A Transition-based Algorithm for AMR Parsing
Chuan Wang | Nianwen Xue | Sameer Pradhan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Improving the Inference of Implicit Discourse Relations via Classifying Explicit Discourse Connectives
Attapol Rutherford | Nianwen Xue
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Harmonizing word alignments and syntactic structures for extracting phrasal translation equivalents
Dun Deng | Nianwen Xue | Shiman Guo
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Feature Optimization for Constituent Parsing via Neural Networks
Zhiguo Wang | Haitao Mi | Nianwen Xue
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Recovering dropped pronouns from Chinese text messages
Yaqin Yang | Yalin Liu | Nianwen Xue
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf
Boosting Transition-based AMR Parsing with Refined Actions and Auxiliary Analyzers
Chuan Wang | Nianwen Xue | Sameer Pradhan
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf abs
Buy one get one free: Distant annotation of Chinese tense, event type and modality
Nianwen Xue | Yuchen Zhang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe a “distant annotation” method where we mark up the semantic tense, event type, and modality of Chinese events via a word-aligned parallel corpus. We first map Chinese verbs to their English counterparts via word alignment, and then annotate the resulting English text spans with coarse-grained categories for semantic tense, event type, and modality that we believe apply to both English and Chinese. Because English has richer morpho-syntactic indicators for semantic tense, event type and modality than Chinese, our intuition is that this distant annotation approach will yield more consistent annotation than if we annotate the Chinese side directly. We report experimental results that show stable annotation agreement statistics and that event type and modality have significant influence on tense prediction. We also report the size of the annotated corpus that we have obtained, and how different domains impact annotation consistency.

pdf abs
Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech
Nianwen Xue | Ondřej Bojar | Jan Hajič | Martha Palmer | Zdeňka Urešová | Xiuhong Zhang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Abstract Meaning Representations (AMRs) are rooted, directional and labeled graphs that abstract away from morpho-syntactic idiosyncrasies such as word category (verbs and nouns), word order, and function words (determiners, some prepositions). Because these syntactic idiosyncrasies account for many of the cross-lingual differences, it would be interesting to see if this representation can serve, e.g., as a useful, minimally divergent transfer layer in machine translation. To answer this question, we have translated 100 English sentences that have existing AMRs into Chinese and Czech to create AMRs for them. A cross-linguistic comparison of English to Chinese and Czech AMRs reveals both cases where the AMRs for the language pairs align well structurally and cases of linguistic divergence. We found that the level of compatibility of AMR between English and Chinese is higher than between English and Czech. We believe this kind of comparison is beneficial to further refining the annotation standards for each of the three languages and will lead to more compatible annotation guidelines between the languages.

pdf
Aligning Chinese-English Parallel Parse Trees: Is it Feasible?
Dun Deng | Nianwen Xue
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf
Automatic Inference of the Tense of Chinese Events Using Implicit Linguistic Information
Yuchen Zhang | Nianwen Xue
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Joint POS Tagging and Transition-based Constituent Parsing in Chinese with Non-local Features
Zhiguo Wang | Nianwen Xue
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Effective Document-Level Features for Chinese Patent Word Segmentation
Si Li | Nianwen Xue
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Building a Hierarchically Aligned Chinese-English Parallel Treebank
Dun Deng | Nianwen Xue
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Discovering Implicit Discourse Relations Through Brown Cluster Pair Representation and Coreference Patterns
Attapol Rutherford | Nianwen Xue
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf
A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Zhiguo Wang | Chengqing Zong | Nianwen Xue
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Dependency-based empty category detection via phrase structure trees
Nianwen Xue | Yaqin Yang
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Distant annotation of Chinese tense and modality
Nianwen Xue | Yuchen Zhang | Yaqin Yang
Proceedings of the IWCS 2013 Workshop on Annotation of Modal Meanings in Natural Language (WAMM)

2012

pdf
PDTB-style Discourse Annotation of Chinese Text
Yuping Zhou | Nianwen Xue
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Chinese Comma Disambiguation for Discourse Analysis
Yaqin Yang | Nianwen Xue
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Exploring Temporal Vagueness with Mechanical Turk
Yuping Zhou | Nianwen Xue
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Joint Conference on EMNLP and CoNLL - Shared Task
Sameer Pradhan | Alessandro Moschitti | Nianwen Xue
Joint Conference on EMNLP and CoNLL - Shared Task

pdf bib
CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes
Sameer Pradhan | Alessandro Moschitti | Nianwen Xue | Olga Uryupina | Yuchen Zhang
Joint Conference on EMNLP and CoNLL - Shared Task

pdf
Building a Chinese Lexical Taxonomy
Xiaopeng Bai | Nianwen Xue
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
Extending and Scaling up the Chinese Treebank Annotation
Xiuhong Zhang | Nianwen Xue
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

Parallel aligned treebanks (PAT) are linguistic corpora annotated with morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. They are valuable resources for improving machine translation (MT) quality. Recently, there has been an increasing demand for such data, especially for divergent language pairs. The Linguistic Data Consortium (LDC) and its academic partners have been developing Arabic-English and Chinese-English PATs for several years. This paper describes the PAT corpus creation effort for the program GALE (Global Autonomous Language Exploitation) and introduces the potential issues of scaling up this PAT effort for the program BOLT (Broad Operational Language Translation). Based on existing infrastructures and in the light of current annotation process, challenges and approaches, we are exploring new methodologies to address emerging challenges in constructing PATs, including data volume bottlenecks, dialect issues of Arabic languages, and new genre features related to rapidly changing social media. Preliminary experimental results are presented to show the feasibility of the approaches proposed.

pdf abs
Annotating dropped pronouns in Chinese newswire text
Elizabeth Baran | Yaqin Yang | Nianwen Xue
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We propose an annotation framework to explicitly identify dropped subject pronouns in Chinese. We acknowledge and specify 10 concrete pronouns that exist as words in Chinese and 4 abstract pronouns that do not correspond to Chinese words, but that are recognized conceptually, to native Chinese speakers. These abstract pronouns are identified as """"unspecified"""", """"pleonastic"""", """"event"""", and """"existential"""" and are argued to exist cross-linguistically. We trained two annotators, fluent in Chinese, and adjudicated their annotations to form a gold standard. We achieved an inter-annotator agreement kappa of .6 and an observed agreement of .7. We found that annotators had the most difficulty with the abstract pronouns, such as """"unspecified"""" and """"event"""", but we posit that further specification and training has the potential to significantly improve these results. We believe that this annotated data will serve to help improve Machine Translation models that translate from Chinese to a non pro-drop language, like English, that requires all subject pronouns to be explicit.

In the context of Natural Language Processing, annotation is about recovering implicit information that is useful for natural language applications. In this paper we describe a tense annotation task for Chinese - a language that does not have grammatical tense - that is designed to infer the temporal location of a situation in relation to the temporal deixis, the moment of speech. If successful, this would be a highly rewarding endeavor as it has application in many natural language systems. Our preliminary experiments show that while this is a very challenging annotation task for which high annotation consistency is very difficult but not impossible to achieve. We show that guidelines that provide a conceptually intuitive framework will be crucial to the success of this annotation effort.

pdf
Automatic Inference of the Temporal Location of Situations in Chinese Text
Nianwen Xue
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf
Labeling Chinese Predicates with Semantic Roles
Nianwen Xue
Computational Linguistics, Volume 34, Number 2, June 2008 - Special Issue on Semantic Role Labeling

2006

pdf abs
Annotating the Predicate-Argument Structure of Chinese Nominalizations
Nianwen Xue
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the Chinese NomBank Project, the goal of which is to annotate the predicate-argument structure of nominalized predicates in Chinese. The Chinese Nombank extends the general framework of the English and Chinese Proposition Banks to the annotation of nominalized predicates and adds a layer of semantic annotation to the Chinese Treebank. We first outline the scope of the work by discussing the markability of the nominalized predicates and their arguments. We then attempt to provide a categorization of the distribution of the arguments of nominalized predicates. We also discuss the relevance of the event/result distinction to the annotation of nominalized predicates and the phenomenon of incorporation. Finally we discuss some cross-linguistic differences between English and Chinese.

pdf
Aligning Features with Sense Distinction Dimensions
Nianwen Xue | Jinying Chen | Martha Palmer
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Semantic role labeling of nominalized predicates in Chinese
Nianwen Xue
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

2005

pdf
A Parallel Proposition Bank II for Chinese and English
Martha Palmer | Nianwen Xue | Olga Babko-Malaya | Jinying Chen | Benjamin Snyder
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

pdf
Annotating Discourse Connectives in the Chinese Treebank
Nianwen Xue
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

2004

pdf
Proposition Bank II: Delving Deeper
Olga Babko-Malaya | Martha Palmer | Nianwen Xue | Aravind Joshi | Seth Kulick
Proceedings of the Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004

pdf
Calibrating Features for Semantic Role Labeling
Nianwen Xue | Martha Palmer
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Chinese Word Segmentation as Character Tagging
Nianwen Xue
International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing

pdf
Annotating the Propositions in the Penn Chinese Treebank
Nianwen Xue | Martha Palmer
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

pdf
Chinese Word Segmentation as LMR Tagging
Nianwen Xue | Libin Shen
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

pdf abs
Automatic predicate argument structure analysis of the Penn Chinese Treebank
Nianwen Xue | Seth Kulick
Proceedings of Machine Translation Summit IX: Papers

Recent work in machine translation and information extraction has demonstrated the utility of a level that represents the predicate-argument structure. It would be especially useful for machine translation to have two such Proposition Banks, one for each language under consideration. A Proposition Bank for English has been developed over the last few years, and we describe here our development of a tool for facilitating the development of a Chinese Proposition Bank. We also discuss some issues specific to the Chinese Treebank that complicate the matter of mapping syntactic representation to a predicate-argument level, and report on some preliminary evaluation of the accuracy of the semantic tagging tool.