Information-theoretic Distinctions Between Deception and Confusion

Robin Young

Information-theoretic Distinctions Between Deception and Confusion

Abstract

We propose an information-theoretic formalization of the distinction between two fundamental AI safety failure modes: deceptive alignment and goal drift. While both can lead to systems that appear misaligned, we demonstrate that they represent distinct forms of information divergence occurring at different interfaces in the human-AI system. Deceptive alignment creates entropy between an agent’s true goals and its observable behavior, while goal drift, or confusion, creates entropy between the intended human goal and the agent’s actual goal. Though often observationally equivalent, these failures necessitate different interventions. We present a formal model and an illustrative thought experiment to clarify this distinction. We offer a formal language for re-examining prominent alignment challenges observed in Large Language Models (LLMs), offering novel perspectives on their underlying causes.

Anthology ID:: 2025.findings-ijcnlp.15
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venue:: Findings
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 258–268
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.15/
DOI:
Bibkey:
Cite (ACL):: Robin Young. 2025. Information-theoretic Distinctions Between Deception and Confusion. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 258–268, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Information-theoretic Distinctions Between Deception and Confusion (Young, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.15.pdf

PDF Cite Search Fix data