Ryo Mitsuhashi
2026
Disentangling the Effects of Unlearning in Measuring Parametric Faithfulness of Chain-of-Thought
Ryo Mitsuhashi | Gaku Morio | Ayana Niwa | Masahiro Kaneko | Kentaro Inui | Terufumi Morishita | Yuta Koreeda | Yasuhiro Sogawa
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Ryo Mitsuhashi | Gaku Morio | Ayana Niwa | Masahiro Kaneko | Kentaro Inui | Terufumi Morishita | Yuta Koreeda | Yasuhiro Sogawa
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Chain-of-Thought (CoT) in large language models (LLMs) has been widely debated in terms of whether it faithfully reflects an internal reasoning process of models. Parametric faithfulness is a recently proposed metric that uses unlearning to assess whether a model encodes parametric beliefs corresponding to a reasoning chain. This paper refines this metric by accounting for the unintended artifacts of unlearning. We introduce control tasks that unlearn irrelevant knowledge and word-shuffled content and show that these control tasks yield substantial parametric faithfulness values, suggesting the non-negligible effect of unlearning. We also found that control tasks help explain the significant variations in parametric faithfulness observed across different model sizes and CoT lengths. We conclude that the effects of unlearning need to be considered when measuring parametric faithfulness.