Qiaozhi He
2026
Cross-layer Attention Sharing for Pre-trained Large Language Models
Yongyu Mu | Yuzhang Wu | Yuchun Fan | Chenglong Wang | Hengyu Li | Jiali Zeng | Qiaozhi He | Murun Yang | Fandong Meng | Jie Zhou | Tong Xiao | Jingbo Zhu
Transactions of the Association for Computational Linguistics, Volume 14
Yongyu Mu | Yuzhang Wu | Yuchun Fan | Chenglong Wang | Hengyu Li | Jiali Zeng | Qiaozhi He | Murun Yang | Fandong Meng | Jie Zhou | Tong Xiao | Jingbo Zhu
Transactions of the Association for Computational Linguistics, Volume 14
To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the Key-Value cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It’s intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LiSA, a lightweight substitute for self-attention in well-trained LLMs. LiSA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LiSA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53% −84% of the total layers. Our implementations of LiSA achieve a 6 × compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively. Our code is available at https://github.com/takagi97/lisa.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
Yifu Huo | Chenglong Wang | Ziming Zhu | Shunjie Xing | Peinan Feng | Tongran Liu | Qiaozhi He | Tian Hua Zhou | Changxiaojia | JingBo Zhu | Zhengtao Yu | Tong Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Yifu Huo | Chenglong Wang | Ziming Zhu | Shunjie Xing | Peinan Feng | Tongran Liu | Qiaozhi He | Tian Hua Zhou | Changxiaojia | JingBo Zhu | Zhengtao Yu | Tong Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams
Chenglong Wang | Canjia Li | Xingzhao Zhu | Yifu Huo | Huiyu Wang | Weixiong Lin | Yun Yang | Qiaozhi He | Tian Hua Zhou | Changxiaojia | JingBo Zhu | Tong Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Chenglong Wang | Canjia Li | Xingzhao Zhu | Yifu Huo | Huiyu Wang | Weixiong Lin | Yun Yang | Qiaozhi He | Tian Hua Zhou | Changxiaojia | JingBo Zhu | Tong Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluated SERM on a large-scale industrial platform, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.
2025
基于关联神经元识别的知识编辑方法
Yuzhang Wu | Yongyu Mu | Chenglong Wang | Qiaozhi He | Tong Xiao | Anxiang Ma | Chunliang Zhang | JingBo Zhu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Yuzhang Wu | Yongyu Mu | Chenglong Wang | Qiaozhi He | Tong Xiao | Anxiang Ma | Chunliang Zhang | JingBo Zhu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
"近年来,大语言模型展现出了从训练语料中存储并提取知识的优秀能力,但相应地,其可靠性也容易遭受训练语料中错误信息的破坏,进而产生信息过时、错误回复等问题。基于神经元识别的知识编辑方法通过在模型中识别并微调与目标知识相关的知识神经元,实现对模型内部知识的精确修改。然而,本文研究发现,知识的表达形式会显著影响知识神经元的识别结果,例如,现有神经元识别方法对于同一知识的不同表达形式识别得到的神经元集合平均重叠率只有21.86%。这就导致只对单一的表达形式进行知识编辑无法覆盖到与这个知识相关的所有神经元,所以现有知识编辑方法的鲁棒性往往较差。为了全面且准确地识别到与某一知识相关的所有神经元,本文设计了一种轻量级关联神经元识别器(Light weight Associated Neuron Detector,LAND),通过学习不同表达形式的知识识别出的知识神经元集合之间的差异,从而在知识神经元识别的过程中,自动补全因表达形式差异而未被检出的知识神经元。实验结果表明,LAND方法能够将不同表达形式的文本识别出的知识神经元平均重叠率提升至96%以上,在不同句式的知识编辑成功率上较基线方法多提升了至多10.83个百分点。"