Are Your LLMs Capable of Stable Reasoning?

Junnan Liu; Hongwei Liu; Linchen Xiao; Ziyi Wang; Kuikun Liu; Songyang Gao; Wenwei Zhang; Songyang Zhang; Kai Chen

doi:10.18653/v1/2025.findings-acl.905

Are Your LLMs Capable of Stable Reasoning?

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen

Abstract

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce **G-Pass@**k, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@k in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

Anthology ID:: 2025.findings-acl.905
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17594–17632
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.905/
DOI:: 10.18653/v1/2025.findings-acl.905
Bibkey:
Cite (ACL):: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. 2025. Are Your LLMs Capable of Stable Reasoning?. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17594–17632, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Are Your LLMs Capable of Stable Reasoning? (Liu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.905.pdf

PDF Cite Search Fix data