Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic

Zhenjiang Mao, Artem Bisliouk, Rohith Nama, Ivan Ruchkin


Abstract
Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particular, we define formal STL-based constraints to capture desirable temporal properties and compute robustness scores that serve as structured, interpretable confidence estimates. Our approach also introduces a set of uncertainty reshaping strategies to enforce smoothness, monotonicity, and causal consistency across the reasoning trajectory. Experiments show that our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
Anthology ID:
2025.bea-1.65
Volume:
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venues:
BEA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
882–890
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.bea-1.65/
DOI:
Bibkey:
Cite (ACL):
Zhenjiang Mao, Artem Bisliouk, Rohith Nama, and Ivan Ruchkin. 2025. Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 882–890, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic (Mao et al., BEA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.bea-1.65.pdf