Chrisanna Cornish


2025

Chain-of-Thought (CoT) ‘reasoning’ promises to enhance the performance and transparency of Large Language Models (LLMs). Models, such as Deepseek R1, are trained via reinforcement learning to automatically generate CoT explanations in their outputs. Their faithfulness, i.e. how well the explanations actually reflect their internal reasoning process, has been called into doubt by recent studies (Chen et al., 2025a; Chua and Evans, 2025). This paper extends previous work by probing Deepseek R1 with 445 logical puzzles under zero- and few-shot settings. We find that whilst the model explicitly acknowledges a strong harmful hint in 94.6% of cases, it reports less than 2% of helpful hints. Further analysis reveals implicit unfaithfulness as the model significantly reduces answer-rechecking behaviour for helpful hints (p<0.01) despite rarely mentioning them in its CoT, demonstrating a discrepancy between its reported and actual decision process. In line with prior reports for GPT, Claude, Gemini and other models, our results for DeepSeek raise concerns about the use of CoT as an explainability technique.