Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman
Abstract
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions . First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.- Anthology ID:
- 2025.findings-acl.347
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6679–6700
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.347/
- DOI:
- Cite (ACL):
- Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, and Rema Padman. 2025. Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6679–6700, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions (Li et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.347.pdf