Chengzhao Wu


2025

pdf bib
Analysing Reference Production of Large Language Models
Chengzhao Wu | Guanyi Chen | Fahime Same | Tingting He
Proceedings of the 18th International Natural Language Generation Conference

This study investigates how large language models (LLMs) produce referring expressions (REs) and to what extent their behaviour aligns with human patterns. We evaluate LLM performance in two settings: slot filling, %KvD the conventional task of referring expression generation, where REs are generated within a fixed context, and language generation, where REs are analysed within fully generated texts. Using the WebNLG corpus, we assess how well LLMs capture human variation in reference production and analyse their behaviour by examining the influence of several factors known to affect human reference production, including referential form, syntactic position, recency, and discourse status. Our findings show that (1) task framing significantly affects LLMs’ reference production; (2) while LLMs are sensitive to some of these factors, their referential behaviour consistently diverges from human use; and (3) larger model size does not necessarily yield more human-like variation. These results underscore key limitations in current LLMs’ ability to replicate human referential choices.

pdf bib
CCNU at SemEval-2025 Task 8: Enhancing Question Answering on Tabular Data with Two-Stage Corrections
Chenlian Zhou | Xilu Cai | Yajuan Tong | Chengzhao Wu | Xin Xu | Guanyi Chen | Tingting He
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We present the system developed by the Central China Normal University (CCNU) team for the SemEval-2025 shared task 8, which focuses on Question-Answering (QA) for tabular data. Our approach leverages multiple Large Language Models (LLMs), conducting tabular QA as code completion. Additionally, to improve its reliability, we introduce a two-stage corrections mechanism, in which we instruct the LLM to correct the code according to the judges of whether the code is executable and whether the answer obtained from executing the code is semantically consistent with the question. The experiment demonstrates that code correction works but answer correction does not. Finally, we discuss other unsuccessful approaches explored during our development process.