Zhixuan Chu
2026
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Ziwen Xu | Chenyan WU | Hengyu Sun | Haiwen Hong | Mengru Wang | Yunzhi Yao | Longtao Huang | Hui Xue | Shumin Deng | Zhixuan Chu | Huajun Chen | Ningyu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziwen Xu | Chenyan WU | Hengyu Sun | Haiwen Hong | Mengru Wang | Yunzhi Yao | Longtao Huang | Hui Xue | Shumin Deng | Zhixuan Chu | Huajun Chen | Ningyu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model’s valid-generation manifold. Finally, we introduce a new steering approach guided by this analysis that improves preference while better preserving utility.
Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the pre-trained model’s own perplexity on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments on both medical and general-domain benchmarks demonstrate that our method consistently identifies near-optimal training subsets, achieving superior performance with significantly reduced data consumption.
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Jiaqi Weng | Han Zheng | Hanyu Zhang | Ej Zhou | Qinqin He | Jialing Tao | Hui Xue | Zhixuan Chu | Xiting Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jiaqi Weng | Han Zheng | Hanyu Zhang | Ej Zhou | Qinqin He | Jialing Tao | Hui Xue | Zhixuan Chu | Xiting Wang
Findings of the Association for Computational Linguistics: ACL 2026
Sparse autoencoders (SAEs) enable interpretability research by decomposing entangled model activations into monosemantic features. However, under what circumstances SAEs derive most fine-grained latent features for safety—a low-frequency concept domain—remains unexplored. Two key challenges exist: identifying SAEs with the greatest potential for generating safety domain-specific features, and the prohibitively high cost of detailed feature explanation. In this paper, we propose **Safe-SAIL**, a unified framework for interpreting SAE features in safety-critical domains to advance mechanistic understanding of large language models. Safe-SAIL introduces a pre-explanation evaluation metric to efficiently identify SAEs with strong safety domain-specific interpretability, and reduces interpretation cost by 55% through a segment-level simulation strategy. Building on Safe-SAIL, we train a comprehensive suite of SAEs with human-readable explanations and systematic evaluations for 1,758 safety-related features spanning four domains: pornography, politics, violence, and terror. Using this resource, we conduct empirical analyses and provide insights on the effectiveness of Safe-SAIL for risk feature identification and how safety-critical entities and concepts are encoded across model layers. All models, explanations, and tools are publicly released in an open-source toolkit at https://anonymous.4open.science/r/Safe-SAIL/.
2025
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan | Yongqi Tong | Xin Zhang | Xiaolu Zhang | Jun Zhou | Zhixuan Chu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Licheng Pan | Yongqi Tong | Xin Zhang | Xiaolu Zhang | Jun Zhou | Zhixuan Chu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries—a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models’ safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present **RASS**, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, **RASS** efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios. We have explored the safety decision boundaries of various LLMs and construct the **MORBench** evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets are available at https://github.com/Master-PLC/RASS.
2024
Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models
Wenqing Chen | Weicheng Wang | Zhixuan Chu | Kui Ren | Zibin Zheng | Zhichao Lu
Findings of the Association for Computational Linguistics: ACL 2024
Wenqing Chen | Weicheng Wang | Zhixuan Chu | Kui Ren | Zibin Zheng | Zhichao Lu
Findings of the Association for Computational Linguistics: ACL 2024
Recently, the self-consistency decoding strategy has shown the ability to improve performance for complex reasoning tasks with large language models (LLMs). However, the costs may be high because the sampling process of the strategy generates some low-probability text, resulting in low-quality reasoning paths. As a consequence, it requires a relatively large sampling number to obtain good aggregation performance. In this paper, we propose an alternative strategy, self-para-consistency. It first generates multiple paraphrases for each test question, then generates reasoning paths for the original and all the paraphrased questions based on greedy decoding, and finally selects the most consistent answer. Since all the candidate paths have relatively high probabilities, the sampling number could be much smaller than the self-consistency strategy. Extensive experiments on complex reasoning datasets demonstrate the effectiveness of our method in reducing the sampling number.
2022
Incorporating Causal Analysis into Diversified and Logical Response Generation
Jiayi Liu | Wei Wei | Zhixuan Chu | Xing Gao | Ji Zhang | Tan Yan | Yulin Kang
Proceedings of the 29th International Conference on Computational Linguistics
Jiayi Liu | Wei Wei | Zhixuan Chu | Xing Gao | Ji Zhang | Tan Yan | Yulin Kang
Proceedings of the 29th International Conference on Computational Linguistics
Although the Conditional Variational Auto-Encoder (CVAE) model can generate more diversified responses than the traditional Seq2Seq model, the responses often have low relevance with the input words or are illogical with the question. A causal analysis is carried out to study the reasons behind, and a methodology of searching for the mediators and mitigating the confounding bias in dialogues is provided. Specifically, we propose to predict the mediators to preserve relevant information and auto-regressively incorporate the mediators into generating process. Besides, a dynamic topic graph guided conditional variational auto-encoder (TGG-CVAE) model is utilized to complement the semantic space and reduce the confounding bias in responses. Extensive experiments demonstrate that the proposed model is able to generate both relevant and informative responses, and outperforms the state-of-the-art in terms of automatic metrics and human evaluations.
Search
Fix author
Co-authors
- Kui Ren 2
- Hui Xue 2
- Huajun Chen 1
- Wenqing Chen 1
- Shumin Deng 1
- Xing Gao 1
- Jinjie Gu 1
- Qinqin He 1
- Haiwen Hong 1
- Longtao Huang 1
- Yulin Kang 1
- Jiayi Liu 1
- Lei Liu 1
- Zhichao Lu 1
- Licheng Pan 1
- Yue Shen 1
- Hengyu Sun 1
- Jialing Tao 1
- Yongqi Tong 1
- Chenyan WU 1
- Mengru Wang 1
- Jian Wang 1
- Xiting Wang 1
- Weicheng Wang 1
- Wei Wei 1
- Jiaqi Weng 1
- Ziwen Xu 1
- Tan Yan 1
- Xiaoyan Yang 1
- Yunzhi Yao 1
- Ji Zhang 1
- Ningyu Zhang 1
- Xin Zhang 1
- Xiaolu Zhang 1
- Hanyu Zhang 1
- Han Zheng 1
- Zibin Zheng 1
- Jun Zhou 1
- Ej Zhou 1
- Hao Zhu 1