Chengzhi Li

2025

Large language models (LLMs) can solve complex multi-step math reasoning problems, but little is known about how these computations are implemented internally. Many recent studies have investigated the mechanisms of LLMs on simple arithmetic tasks (e.g., a+b, a× b), but how LLMs solve mixed arithmetic tasks still remains unexplored. This gap highlights the limitation of these findings in reflecting real-world scenarios. In this work, we take a step further to explore how LLMs compute mixed arithmetic expressions. We find that LLMs follow a similar workflow to mixed arithmetic calculations: first parsing the complete expression, then using attention heads to aggregate information to the last token position for result generation, without step-by-step reasoning at the token dimension. However, **for some specific expressions, the model generates the final result depends on the generation of intermediate results at the last token position, which is similar to human thinking.** Furthermore, we propose a **C**ausal **E**ffect **D**riven **F**ine-tuning method (CEDF) to adaptively enhance the identified key components used to execute mixed arithmetic calculations to improve LLMs reasoning ability.

With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods

pdf bib abs
Option Symbol Matters: Investigating and Mitigating Multiple-Choice Option Symbol Bias of Large Language Models
Zhen Yang | Ping Jian | Chengzhi Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Multiple-Choice Question Answering (MCQA) is a widely used task in the evaluation of Large Language Models (LLMs). In this work, we reveal that current LLMs’ performance in MCQA could be heavily influenced by the choice of option symbol sets, due to the option symbol bias. That is, when altering only the option symbols (e.g., A/B/C/D→i/ii/iii/iv), the results could vary sharply, leading to a margin of approximately 10% in accuracy. To uncover the mechanisms behind this, we investigate the internal components of LLMs from a causal perspective. By measuring the causal effects, we identify a small subset of attention heads responsible for the symbol bias. Subsequently, we interpret these key components in a human-understandable way, showing that attention heads with higher causal effects are more likely to focus on only option symbols, while those with lower causal effects tend to distribute their attention across the content of questions and options. It also motivates us to pursue debiasing based on the causal effects. Specifically, to mitigate such bias, we propose a tuning-free, causal effect driven debiasing method which intervenes the activations of identified components according to their causal effects, with stronger interventions corresponding to higher causal effects. Experimental results demonstrate that the proposed method not only alleviates aforementioned bias, but also improves the MCQA performance of LLMs.

Co-authors

Yifan Wang 1

Xinyue Zhang 1

Venues

findings2
naacl1

Fix author