2025
pdf
bib
abs
LoGU: Long-form Generation with Uncertainty Expressions
Ruihan Yang
|
Caiqi Zhang
|
Zhisong Zhang
|
Xinting Huang
|
Sen Yang
|
Nigel Collier
|
Dong Yu
|
Deqing Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but real-world applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty (LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
pdf
bib
abs
UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation
Ruihan Yang
|
Caiqi Zhang
|
Zhisong Zhang
|
Xinting Huang
|
Dong Yu
|
Nigel Collier
|
Deqing Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs’ ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE covers five domains and includes more than 1,000 entities, each with paired short- and long-form QA items. Our dataset is the first to directly link short- and long-form QA through aligned questions and gold-standard answers.Along with UNCLE, we propose a suite of new metrics to assess the models’ capabilities to selectively express uncertainty. We then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models’ performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
pdf
bib
abs
Atomic Calibration of LLMs in Long-Form Generations
Caiqi Zhang
|
Ruihan Yang
|
Zhisong Zhang
|
Xinting Huang
|
Sen Yang
|
Dong Yu
|
Nigel Collier
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.
pdf
bib
abs
SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals
Ruihan Yang
|
Jiangjie Chen
|
Yikai Zhang
|
Siyu Yuan
|
Aili Chen
|
Kyle Richardson
|
Yanghua Xiao
|
Deqing Yang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Language agents powered by large language models (LLMs) are increasingly valuable as decision-making tools in domains such as gaming and programming. However, these agents often face challenges in achieving high-level goals without detailed instructions and in adapting to environments where feedback is delayed. In this paper, we present SELFGOAL, a novel automatic approach designed to enhance agents’ capabilities to achieve high-level goals with limited human prior and environmental feedback. The core concept of SELFGOAL involves adaptively breaking down a high-level goal into a tree structure of more practical subgoals during the interaction with environments while identifying the most useful subgoals and progressively updating this structure. Experimental results demonstrate that SELFGOAL significantly enhances the performance of language agents across various tasks, including competitive, cooperative, and deferred feedback environments.
2024
pdf
bib
abs
GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick
Jiayi Fu
|
Xuandong Zhao
|
Ruihan Yang
|
Yuansen Zhang
|
Jiangjie Chen
|
Yanghua Xiao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excellently generate human-like text, but also raise concerns about misuse in fake news and academic dishonesty. Decoding-based watermark, particularly the watermark based on the GumbelMax trick (GM watermark), is a standout solution for safeguarding machine-generated texts due to its notable detectability. However, GM watermark encounters a major challenge with generation diversity, always yielding identical outputs for the same prompt, negatively impacting generation diversity and user experience. To overcome this limitation, we introduce a new type of GM watermark, the Logits-Addition watermark, as well as three variants that aim to enhance diversity, particularly the GumbelSoft watermark (i.e., the softmax variant of the Logits-Addition watermark). When assessed for detectability in high diversity settings, our Gumbelsoft demonstrates superior performance, with its AUROC score exceeding those of the two alternative variants by a margin of 0.1 to 0.3 and outperforming other decoding-based watermarking methods by a minimum of 0.1.