Xinyi Liu

2024

pdf abs
An Empirical Analysis on Large Language Models in Debate Evaluation
Xinyi Liu | Pinxin Liu | Hangfeng He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this study, we investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We discover that LLM’s performance exceeds humans and surpasses the performance of state-of-the-art methods fine-tuned on extensive datasets. We additionally explore and analyze biases present in LLMs, including positional bias, lexical bias, order bias, which may affect their evaluative judgments. Our findings reveal a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented, attributed to prompt design. We also uncover a lexical bias in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential, highlighting the critical need for careful label verbalizer selection in prompt design. Additionally, our analysis indicates a tendency of both models to favor the debate’s concluding side as the winner, suggesting an end-of-discussion bias.

Euphemisms are found across the world’s languages, making them a universal linguistic phenomenon. As such, euphemistic data may have useful properties for computational tasks across languages. In this study, we explore this premise by training a multilingual transformer model (XLM-RoBERTa) to disambiguate potentially euphemistic terms (PETs) in multilingual and cross-lingual settings. In line with current trends, we demonstrate that zero-shot learning across languages takes place. We also show cases where multilingual models perform better on the task compared to monolingual models by a statistically significant margin, indicating that multilingual data presents additional opportunities for models to learn about cross-lingual, computational properties of euphemisms. In a follow-up analysis, we focus on universal euphemistic “categories” such as death and bodily functions among others. We test to see whether cross-lingual data of the same domain is more important than within-language data of other domains to further understand the nature of the cross-lingual transfer.

pdf abs
JN666 at SemEval-2024 Task 7: NumEval: Numeral-Aware Language Understanding and Generation
Xinyi Liu | Xintong Liu | Hengyang Lu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper is submitted for SemEval-2027 task 7: Enhancing the Model’s Understanding and Generation of Numerical Values. The dataset for this task is NQuAD, which requires us to select the most suitable option number from four numerical options to fill in the blank in a news article based on the context. Based on the BertForMultipleChoice model, we proposed two new models, MC BERT and SSC BERT, and improved the model’s numerical understanding ability by pre-training the model on numerical comparison tasks. Ultimately, our best-performing model achieved an accuracy rate of 79.40%, which is 9.45% higher than the accuracy rate of NEMo.

Co-authors

Xinyi Liu

2024

Co-authors

Venues