Tim G. J. Rudner


2025

pdf bib
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
Gabrielle Kaili-May Liu | Gal Yona | Avi Caciularu | Idan Szpektor | Tim G. J. Rudner | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of _faithful confidence calibration_ of LLMs, benchmarking models’ ability to use linguistic expressions of uncertainty that _faithfully reflect_ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.

pdf bib
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models
Carter Teplica | Yixin Liu | Arman Cohan | Tim G. J. Rudner
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We investigate the mechanistic sources of uncertainty in large language models (LLMs), an area with important implications for language model reliability and trustworthiness. To do so, we conduct a series of experiments designed to identify whether the factuality of generated responses and a model’s uncertainty originate in separate or shared circuits in the model architecture. We approach this question by adapting the well-established mechanistic interpretability techniques of causal tracing and zero-ablation to study the effect of different circuits on LLM generations. Our experiments on eight different models and five datasets, representing tasks predominantly requiring factual recall, provide strong evidence that a model’s uncertainty is produced in the same parts of the network that are responsible for the factuality of generated responses.