Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression

Farima Fatahi Bayat; Xin Liu; H. Jagadish; Lu Wang

doi:10.18653/v1/2024.findings-acl.737

Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression

Farima Fatahi Bayat, Xin Liu, H. Jagadish, Lu Wang

Abstract

Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the “truthful directions” previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model’s internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO.

Anthology ID:: 2024.findings-acl.737
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12388–12400
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-acl.737/
DOI:: 10.18653/v1/2024.findings-acl.737
Bibkey:
Cite (ACL):: Farima Fatahi Bayat, Xin Liu, H. Jagadish, and Lu Wang. 2024. Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12388–12400, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression (Fatahi Bayat et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-acl.737.pdf

PDF Cite Search Fix data