Moral Self-correction is Not An Innate Capability in Language Models

Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Johnson


Abstract
Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness.Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs’ moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.
Anthology ID:
2025.findings-ijcnlp.39
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venue:
Findings
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
660–683
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.39/
DOI:
Bibkey:
Cite (ACL):
Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, and Kristen Johnson. 2025. Moral Self-correction is Not An Innate Capability in Language Models. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 660–683, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
Moral Self-correction is Not An Innate Capability in Language Models (Liu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.39.pdf