Jingcheng Niu


Using Roark-Hollingshead Distance to Probe BERT’s Syntactic Competence
Jingcheng Niu | Wenjie Lu | Eric Corlett | Gerald Penn
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Probing BERT’s general ability to reason about syntax is no simple endeavour, primarily because of the uncertainty surrounding how large language models represent syntactic structure. Many prior accounts of BERT’s agility as a syntactic tool (Clark et al., 2013; Lau et al., 2014; Marvin and Linzen, 2018; Chowdhury and Zamparelli, 2018; Warstadt et al., 2019, 2020; Hu et al., 2020) have therefore confined themselves to studying very specific linguistic phenomena, and there has still been no definitive answer as to whether BERT “knows” syntax.The advent of perturbed masking (Wu et al., 2020) would then seem to be significant, because this is a parameter-free probing method that directly samples syntactic trees from BERT’s embeddings. These sampled trees outperform a right-branching baseline, thus providing preliminary evidence that BERT’s syntactic competence bests a simple baseline. This baseline is underwhelming, however, and our reappraisal below suggests that this result, too, is inconclusive.We propose RH Probe, an encoder-decoder probing architecture that operates on two probing tasks. We find strong empirical evidence confirming the existence of important syntactic information in BERT, but this information alone appears not to be enough to reproduce syntax in its entirety. Our probe makes crucial use of a conjecture made by Roark and Holling-shead (2008) that a particular lexical annotation that we shall call RH distance is a sufficient encoding of unlabelled binary syntactic trees, and we prove this conjecture.

Does BERT Rediscover a Classical NLP Pipeline?
Jingcheng Niu | Wenjie Lu | Gerald Penn
Proceedings of the 29th International Conference on Computational Linguistics

Does BERT store surface knowledge in its bottom layers, syntactic knowledge in its middle layers, and semantic knowledge in its upper layers? In re-examining Jawahar et al. (2019) and Tenney et al.’s (2019a) probes into the structure of BERT, we have found that the pipeline-like separation that they asserted lacks conclusive empirical support. BERT’s structure is, however, linguistically founded, although perhaps in a way that is more nuanced than can be explained by layers alone. We introduce a novel probe, called GridLoc, through which we can also take into account token positions, training rounds, and random seeds. Using GridLoc, we are able to detect other, stronger regularities that suggest that pseudo-cognitive appeals to layer depth may not be the preferable mode of explanation for BERT’s inner workings.

Bringing the State-of-the-Art to Customers: A Neural Agent Assistant Framework for Customer Service Support
Stephen Obadinma | Faiza Khan Khattak | Shirley Wang | Tania Sidhorn | Elaine Lau | Sean Robertson | Jingcheng Niu | Winnie Au | Alif Munim | Karthik Raja Kalaiselvi Bhaskar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Building Agent Assistants that can help improve customer service support requires inputs from industry users and their customers, as well as knowledge about state-of-the-art Natural Language Processing (NLP) technology. We combine expertise from academia and industry to bridge the gap and build task/domain-specific Neural Agent Assistants (NAA) with three high-level components for: (1) Intent Identification, (2) Context Retrieval, and (3) Response Generation. In this paper, we outline the pipeline of the NAA’s core system and also present three case studies in which three industry partners successfully adapt the framework to find solutions to their unique challenges. Our findings suggest that a collaborative process is instrumental in spurring the development of emerging NLP models for Conversational AI tasks in industry. The full reference implementation code and results are available at https://github.com/VectorInstitute/NAA.


pdf bib
Statistically Evaluating Social Media Sentiment Trends towards COVID-19 Non-Pharmaceutical Interventions with Event Studies
Jingcheng Niu | Erin Rees | Victoria Ng | Gerald Penn
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

In the midst of a global pandemic, understanding the public’s opinion of their government’s policy-level, non-pharmaceutical interventions (NPIs) is a crucial component of the health-policy-making process. Prior work on CoViD-19 NPI sentiment analysis by the epidemiological community has proceeded without a method for properly attributing sentiment changes to events, an ability to distinguish the influence of various events across time, a coherent model for predicting the public’s opinion of future events of the same sort, nor even a means of conducting significance tests. We argue here that this urgently needed evaluation method does already exist. In the financial sector, event studies of the fluctuations in a publicly traded company’s stock price are commonplace for determining the effects of earnings announcements, product placements, etc. The same method is suitable for analysing temporal sentiment variation in the light of policy-level NPIs. We provide a case study of Twitter sentiment towards policy-level NPIs in Canada. Our results confirm a generally positive connection between the announcements of NPIs and Twitter sentiment, and we document a promising correlation between the results of this study and a public-health survey of popular compliance with NPIs.


Grammaticality and Language Modelling
Jingcheng Niu | Gerald Penn
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Ever since Pereira (2000) provided evidence against Chomsky’s (1957) conjecture that statistical language modelling is incommensurable with the aims of grammaticality prediction as a research enterprise, a new area of research has emerged that regards statistical language models as “psycholinguistic subjects” and probes their ability to acquire syntactic knowledge. The advent of The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) has earned a spot on the leaderboard for acceptability judgements, and the polemic between Lau et al. (2017) and Sprouse et al. (2018) has raised fundamental questions about the nature of grammaticality and how acceptability judgements should be elicited. All the while, we are told that neural language models continue to improve. That is not an easy claim to test at present, however, because there is almost no agreement on how to measure their improvement when it comes to grammaticality and acceptability judgements. The GLUE leaderboard bundles CoLA together with a Matthews correlation coefficient (MCC), although probably because CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers working in this area have used other accuracy and correlation scores, often driven by a need to reconcile and compare various discrete and continuous variables with each other. The score that we will advocate for in this paper, the point biserial correlation, in fact compares a discrete variable (for us, acceptability judgements) to a continuous variable (for us, neural language model probabilities). The only previous work in this area to choose the PBC that we are aware of is Sprouse et al. (2018a), and that paper actually applied it backwards (with some justification) so that the language model probability was treated as the discrete binary variable by setting a threshold. With the PBC in mind, we will first reappraise some recent work in syntactically targeted linguistic evaluations (Hu et al., 2020), arguing that while their experimental design sets a new high watermark for this topic, their results may not prove what they have claimed. We then turn to the task-independent assessment of language models as grammaticality classifiers. Prior to the introduction of the GLUE leaderboard, the vast majority of this assessment was essentially anecdotal, and we find the use of the MCC in this regard to be problematic. We conduct several studies with PBCs to compare several popular language models. We also study the effects of several variables such as normalization and data homogeneity on PBC.

Temporal Histories of Epidemic Events (THEE): A Case Study in Temporal Annotation for Public Health
Jingcheng Niu | Victoria Ng | Gerald Penn | Erin E. Rees
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new temporal annotation standard, THEE-TimeML, and a corpus TheeBank enabling precise temporal information extraction (TIE) for event-based surveillance (EBS) systems in the public health domain. Current EBS must estimate the occurrence time of each event based on coarse document metadata such as document publication time. Because of the complicated language and narration style of news articles, estimated case outbreak times are often inaccurate or even erroneous. Thus, it is necessary to create annotation standards and corpora to facilitate the development of TIE systems in the public health domain to address this problem.We will discuss the adaptations that have proved necessary for this domain as we present THEE-TimeML and TheeBank. Finally, we document the corpus annotation process, and demonstrate the immediate benefit to public health applications brought by the annotations.


Rationally Reappraising ATIS-based Dialogue Systems
Jingcheng Niu | Gerald Penn
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The Air Travel Information Service (ATIS) corpus has been the most common benchmark for evaluating Spoken Language Understanding (SLU) tasks for more than three decades since it was released. Recent state-of-the-art neural models have obtained F1-scores near 98% on the task of slot filling. We developed a rule-based grammar for the ATIS domain that achieves a 95.82% F1-score on our evaluation set. In the process, we furthermore discovered numerous shortcomings in the ATIS corpus annotation, which we have fixed. This paper presents a detailed account of these shortcomings, our proposed repairs, our rule-based grammar and the neural slot-filling architectures associated with ATIS. We also rationally reappraise the motivations for choosing a neural architecture in view of this account. Fixing the annotation errors results in a relative error reduction of between 19.4 and 52% across all architectures. We nevertheless argue that neural models must play a different role in ATIS dialogues because of the latter’s lack of variety.