Chenyang Lyu


DCU-ML at the FinNLP-2022 ERAI Task: Investigating the Transferability of Sentiment Analysis Data for Evaluating Rationales of Investors
Chenyang Lyu | Tianbo Ji | Liting Zhou
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

In this paper, we describe our system for the FinNLP-2022 shared task: Evaluating the Rationales of Amateur Investors (ERAI). The ERAI shared tasks focuses on mining profitable information from financial texts by predicting the possible Maximal Potential Profit (MPP) and Maximal Loss (ML) based on the posts from amateur investors. There are two sub-tasks in ERAI: Pairwise Comparison and Unsupervised Rank, both target on the prediction of MPP and ML. To tackle the two tasks, we frame this task as a text-pair classification task where the input consists of two documents and the output is the label of whether the first document will lead to higher MPP or lower ML. Specifically, we propose to take advantage of the transferability of Sentiment Analysis data with an assumption that a more positive text will lead to higher MPP or higher ML to facilitate the prediction of MPP and ML. In experiment on the ERAI blind test set, our systems trained on Sentiment Analysis data and ERAI training data ranked 1st and 8th in ML and MPP pairwise comparison respectively. Code available in this link.

Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains
Chenyang Lyu | Jennifer Foster | Yvette Graham
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Past work that investigates out-of-domain performance of QA systems has mainly focused on general domains (e.g. news domain, wikipedia domain), underestimating the importance of subdomains defined by the internal characteristics of QA datasets.In this paper, we extend the scope of “out-of-domain” by splitting QA examples into different subdomains according to their internal characteristics including question type, text length, answer position. We then examine the performance of QA systems trained on the data from different subdomains. Experimental results show that the performance of QA systems can be significantly reduced when the train data and test data come from different subdomains. These results question the generalizability of current QA systems in multiple subdomains, suggesting the need to combat the bias introduced by the internal characteristics of QA datasets.

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
Tianbo Ji | Yvette Graham | Gareth Jones | Chenyang Lyu | Qun Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of r=0.969. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.

DCU-Lorcan at FinCausal 2022: Span-based Causality Extraction from Financial Documents using Pre-trained Language Models
Chenyang Lyu | Tianbo Ji | Quanwei Sun | Liting Zhou
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In this paper, we describe our DCU-Lorcan system for the FinCausal 2022 shared task: span-based cause and effect extraction from financial documents. We frame the FinCausal 2022 causality extraction task as a span extraction/sequence labeling task, our submitted systems are based on the contextualized word representations produced by pre-trained language models and linear layers predicting the label for each word, followed by post-processing heuristics. In experiments, we employ pre-trained language models including DistilBERT, BERT and SpanBERT. Our best performed system achieves F-1, Recall, Precision and Exact Match scores of 92.76, 92.77, 92.76 and 68.60 respectively. Additionally, we conduct experiments investigating the effect of data size to the performance of causality extraction model and an error analysis investigating the outputs in predictions.

pdf bib
MLLabs-LIG at TempoWiC 2022: A Generative Approach for Examining Temporal Meaning Shift
Chenyang Lyu | Yongxin Zhou | Tianbo Ji
Proceedings of the The First Workshop on Ever Evolving NLP (EvoNLP)

In this paper, we present our system for the EvoNLP 2022 shared task Temporal Meaning Shift (TempoWiC). Different from the typically used discriminative model, we propose a generative approach based on pre-trained generation models. The basic architecture of our system is a seq2seq model where the input sequence consists of two documents followed by a question asking whether the meaning of target word changed or not, the target output sequence is a declarative sentence describing the meaning of target word changed or not. The experimental results on TempoWiC test set show that our best system (with time information) obtained an accuracy and Marco F-1 score of 68.09% and 62.59% respectively, which ranked 12th among all submitted systems. The results have shown the plausibility of using generation model for WiC tasks, meanwhile also indicate there’s still room for further improvement.


Improving Unsupervised Question Answering via Summarization-Informed Question Generation
Chenyang Lyu | Lifeng Shang | Yvette Graham | Jennifer Foster | Xin Jiang | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Question Generation (QG) is the task of generating a plausible question for a given <passage, answer> pair. Template-based QG uses linguistically-informed heuristics to transform declarative sentences into interrogatives, whereas supervised QG uses existing Question Answering (QA) datasets to train a system to generate a question given a passage and an answer. A disadvantage of the heuristic approach is that the generated questions are heavily tied to their declarative counterparts. A disadvantage of the supervised approach is that they are heavily tied to the domain/language of the QA dataset used as training data. In order to overcome these shortcomings, we propose a distantly-supervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system. We make use of freely available news summary data, transforming declarative summary sentences into appropriate questions using heuristics informed by dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model. We extrinsically evaluate our approach using unsupervised QA: our QG model is used to generate synthetic QA pairs for training a QA model. Experimental results show that, trained with only 20k English Wikipedia-based synthetic QA pairs, the QA model substantially outperforms previous unsupervised models on three in-domain datasets (SQuAD1.1, Natural Questions, TriviaQA) and three out-of-domain datasets (NewsQA, BioASQ, DuoRC), demonstrating the transferability of the approach.


Improving Document-Level Sentiment Analysis with User and Product Context
Chenyang Lyu | Jennifer Foster | Yvette Graham
Proceedings of the 28th International Conference on Computational Linguistics

Past work that improves document-level sentiment analysis by encoding user and product in- formation has been limited to considering only the text of the current review. We investigate incorporating additional review text available at the time of sentiment prediction that may prove meaningful for guiding prediction. Firstly, we incorporate all available historical review text belonging to the author of the review in question. Secondly, we investigate the inclusion of his- torical reviews associated with the current product (written by other users). We achieve this by explicitly storing representations of reviews written by the same user and about the same product and force the model to memorize all reviews for one particular user and product. Additionally, we drop the hierarchical architecture used in previous work to enable words in the text to directly attend to each other. Experiment results on IMDB, Yelp 2013 and Yelp 2014 datasets show improvement to state-of-the-art of more than 2 percentage points in the best case.