Zhuowen Han
2026
Neuronal Insights into LLM Attacks: Targeted Neuron Tuning for Precise and Robust Vulnerability Patching
Dan Shi | Renren Jin | Zhuowen Han | Yuqi Ren | Xinwei Wu | Zhigen Li | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Dan Shi | Renren Jin | Zhuowen Han | Yuqi Ren | Xinwei Wu | Zhigen Li | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Despite recent advances in safety alignment, large language models (LLMs) remain highly susceptible to adversarial attacks, while the internal mechanisms behind such vulnerabilities are still poorly understood. Existing gradient-based attribution methods offer valuable interpretability for analyzing information storage and processing in LLMs. However, they are inapplicable to adversarial attacks, which typically occur in open-ended generation settings without fixed ground-truth outputs. To address these challenges, we propose a novel similarity-based gradient attribution method to identify key neurons sensitive to adversarial behaviors in open-ended generation tasks. The detected neurons, termed targeted neurons, play a critical role in safety training. Building on this neuron-level perspective, we uncover two key neuronal patterns: (i) universal neurons that are consistently exploited across multiple attack strategies, and (ii) interference neurons that hinder safety improvements when fine-tuned indiscriminately, providing mechanistic insights into the interpretability of adversarial vulnerabilities. Inspired by these findings, we propose a neuron-level defense strategy, Targeted Neuron Tuning (TNT), which selectively fine-tunes the identified targeted neurons for specific attacks. Experimental evaluations across multiple LLM architectures and scales demonstrate that TNT substantially improves model robustness against a wide range of jailbreak attacks, achieving safe rates exceeding 90% and even approaching 100%, while preserving general task performance, enabling precise and robust safety interventions. Warning: This paper contains example data that may be harmful.
ERRV: Eliciting Efficient Reasoning through Reasoning Vectors for Policy Optimization in Large Language Models
Zhuowen Han | Lei Yang | Renren Jin | Dan Shi | Chenxi Sun | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Zhuowen Han | Lei Yang | Renren Jin | Dan Shi | Chenxi Sun | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Recently, large reasoning models have achieved impressive performance, but their lengthy reasoning processes incur substantial inference overhead. To mitigate this issue, we propose the concept of reasoning vectors, representations extracted from the model’s hidden states, which can guide the model towards generating more concise and accurate responses. Building upon this, we present ERRV, a training framework that elicits efficient reasoning through reasoning vectors, which enables the model to generate high-quality responses during reinforcement learning. By performing targeted policy optimization on both accuracy and length objectives, ERRV effectively activates the model’s latent capability for efficient reasoning. Our experiments demonstrate that after training with ERRV, the model achieves approximately 30% reduction in reasoning length while maintaining stable accuracy, without guidance from the reasoning vector during inference. This establishes a trade-off between efficiency and performance. Furthermore, we identify key properties of reasoning vectors: robustness, characterized by high similarity before and after training, and generalizability, demonstrating applicability across base models, distilled models, RL-trained models, parameter-merged models, and mixed-thought models. These properties collectively guarantee the reliability and broad applicability of our approach.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
Renren Jin | Pengzhi Gao | Yuqi Ren | Zhuowen Han | Tongxuan Zhang | Wuwei Huang | Wei Liu | Jian Luan | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Renren Jin | Pengzhi Gao | Yuqi Ren | Zhuowen Han | Tongxuan Zhang | Wuwei Huang | Wei Liu | Jian Luan | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training data. Furthermore, through both theoretical analysis and empirical validation, we demonstrate that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this insight, we propose Positive-Advantage Reweighting, a simple yet effective approach that regulates model entropy by adjusting the loss weights assigned to tokens with positive advantages during RLVR training, while maintaining competitive performance.
From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Lei Yang | Leiyu Pan | Bojian Xiong | Renren Jin | Shaowei Zhang | Yue Chen | Ling Shi | Jiang Zhou | Junru Wu | Zhen Wang | Jianxiang Peng | Juesi Xiao | Tianyu Dong | Zhuowen Han | Zhuo Chen | Yuqi Ren | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lei Yang | Leiyu Pan | Bojian Xiong | Renren Jin | Shaowei Zhang | Yue Chen | Ling Shi | Jiang Zhou | Junru Wu | Zhen Wang | Jianxiang Peng | Juesi Xiao | Tianyu Dong | Zhuowen Han | Zhuo Chen | Yuqi Ren | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
Dan Shi | Zhuowen Han | Simon Ostermann | Renren Jin | Josef Van Genabith | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dan Shi | Zhuowen Han | Simon Ostermann | Renren Jin | Josef Van Genabith | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear.To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training.We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models’ representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models’ generalization performance, while amplifying them improves base models’ performance. The code is available at https://github.com/danshi777/RL-generalization.
2025
Towards a Unified Paradigm of Concept Editing in Large Language Models
Zhuowen Han | Xinwei Wu | Dan Shi | Renren Jin | Deyi Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhuowen Han | Xinwei Wu | Dan Shi | Renren Jin | Deyi Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Concept editing aims to control specific concepts in large language models (LLMs) and is an emerging subfield of model editing. Despite the emergence of various editing methods in recent years, there remains a lack of rigorous theoretical analysis and a unified perspective to systematically understand and compare these methods. To address this gap, we propose a unified paradigm for concept editing methods, in which all forms of conceptual injection are aligned at the neuron level. We study four representative concept editing methods: Neuron Editing (NE), Supervised Fine-tuning (SFT), Sparse Autoencoder (SAE), and Steering Vector (SV). Then we categorize them into two classes based on their mode of conceptual information injection: indirect (NE, SFT) and direct (SAE, SV). We evaluate above methods along four dimensions: editing reliability, output generalization, neuron level consistency, and mathematical formalization. Experiments show that SAE achieves the best editing reliability. In output generalization, SAE captures features closer to human-understood concepts, while NE tends to locate text patterns rather than true semantics. Neuron-level analysis reveals that direct methods share high neuron overlap, as do indirect methods, indicating methodological commonality within each category. Our unified paradigm offers a clear framework and valuable insights for advancing interpretability and controlled generation in LLMs.
Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria
Yongqi Leng | Renren Jin | Yue Chen | Zhuowen Han | Ling Shi | Jianxiang Peng | Lei Yang | Juesi Xiao | Deyi Xiong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yongqi Leng | Renren Jin | Yue Chen | Zhuowen Han | Ling Shi | Jianxiang Peng | Lei Yang | Juesi Xiao | Deyi Xiong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the increasing capability of large language models (LLMs), LLM-as-a-judge has emerged as a new evaluation paradigm. Compared with traditional automatic and manual evaluation, LLM evaluators exhibit better interpretability and efficiency. Despite this, existing LLM evaluators suffer from limited use scenarios and poor flexibility. To mitigate these issues, we propose Praetor, a fine-grained generative LLM evaluator with instance-level customazable evaluation criteria. To train Praetor, we curate a large-scale dataset guided with a hierarchical guideline covering a wide range of tasks and instance-level evaluation criteria. We train Praetor on this dataset in a multi-task learning fashion, which enables to evaluate LLMs in either pointwise grading or pairwise comparison way and support two languages simultaneously with a high flexibility of setting evaluation criteria. Extensive experiments demonstrate that Praetor outperforms previous LLM evaluators and instruction-tuned LLMs on multiple benchmarks, setting new SOTA results. It also exhibits the potential for generating critiques as scalable feedback to further improve LLMs. Our model and related resources are released at https://github.com/tjunlp-lab/Praetor.
Search
Fix author
Co-authors
- Renren Jin 7
- Deyi Xiong (德意 熊) 7
- Dan Shi 4
- Yuqi Ren 3
- Yue Chen 2
- Jianxiang Peng 2
- Ling Shi 2
- Xinwei Wu 2
- Juesi Xiao 2
- Lei Yang 2
- Zhuo Chen 1
- Tianyu Dong 1
- Pengzhi Gao 1
- Wuwei Huang 1
- Yongqi Leng 1
- Zhigen Li 1
- Wei Liu 1
- Jian Luan 1
- Simon Ostermann 1
- Leiyu Pan 1
- Chenxi Sun 1
- Zhen Wang 1
- Junru Wu 1
- Bojian Xiong 1
- Lei Yang 1
- Shaowei Zhang 1
- Tongxuan Zhang 1
- Jiang Zhou 1
- Josef van Genabith 1