2025
pdf
bib
abs
Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta
|
Runchu Tian
|
Jiawei Han
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false”—as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
pdf
bib
abs
Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs
Runchu Tian
|
Yanghao Li
|
Yuepeng Fu
|
Siyang Deng
|
Qinyu Luo
|
Cheng Qian
|
Shuo Wang
|
Xin Cong
|
Zhong Zhang
|
Yesai Wu
|
Yankai Lin
|
Huadong Wang
|
Xiaojiang Liu
Findings of the Association for Computational Linguistics: ACL 2025
Positional bias in large language models hinders their ability to effectively process long inputs. A prominent example is the “lost in the middle” phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. It includes various tasks and input lengths. Thorough experiments are conducted with three commercial and six open-source models. These experiments reveal that while most current models are more robust against the “lost in the middle” issue, there also exist noticeable biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases for long-context LLMs.
2024
pdf
bib
abs
DebugBench: Evaluating Debugging Capability of Large Language Models
Runchu Tian
|
Yining Ye
|
Yujia Qin
|
Xin Cong
|
Yankai Lin
|
Yinxu Pan
|
Yesai Wu
|
Hui Haotian
|
Liu Weichuan
|
Zhiyuan Liu
|
Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2024
Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce ‘DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.