CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Mengfan Li; Xuanhua Shi; Yang Deng

CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Abstract

Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers’ characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.

Anthology ID:: 2026.acl-long.421
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9302–9317
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.421/
DOI:
Bibkey:
Cite (ACL):: Mengfan Li, Xuanhua Shi, and Yang Deng. 2026. CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9302–9317, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.421.pdf
Checklist:: 2026.acl-long.421.checklist.pdf

PDF Cite Search Checklist Fix data