Moral Self-correction is Not An Innate Capability in Language Models
Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Johnson
Abstract
Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness.Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs’ moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.- Anthology ID:
- 2025.findings-ijcnlp.39
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venue:
- Findings
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 660–683
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.39/
- DOI:
- Cite (ACL):
- Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, and Kristen Johnson. 2025. Moral Self-correction is Not An Innate Capability in Language Models. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 660–683, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- Moral Self-correction is Not An Innate Capability in Language Models (Liu et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.39.pdf