Hanting Chen
2026
Multi-Granularity Semantic Revision for Large Language Model Distillation
Xiaoyu Liu | Yun Zhang | Wei Li | Simiao Li | Xudong Huang | Hanting Chen | Yehui Tang | Jie Hu | Zhiwei Xiong | Yunhe Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaoyu Liu | Yun Zhang | Wei Li | Simiao Li | Xudong Huang | Hanting Chen | Yehui Tang | Jie Hu | Zhiwei Xiong | Yunhe Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Knowledge distillation is crucial for compressing Large Language Models (LLMs), enabling smaller student models to learn from larger teacher models. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, existing distillation loss functions struggle to align the most informative part due to the complex output distributions of LLMs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation (SCRG) strategy. SCRG identifies error tokens by calculating the semantic cognitive difference between teacher and student outputs, corrects them using teacher-generated tokens, and re-generates the sequence to minimize errors. At the token level, we design a distribution adaptive clipping Kullback-Leibler (DAC-KL) loss, which uses a learnable sub-network to focus on semantically dense areas of the teacher’s output, reducing the impact of redundant information. At the span level, we utilize span priors to compute probability correlations within sequences, ensuring consistency between teacher and student outputs to enhance semantic information transfer. Extensive experiments on models ranging from 0.1B to 13B parameters demonstrate the effectiveness of our approach compared to existing methods.
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers
Linrui Ma | Chun Hei Lo | Xinyu Wang | Peng Lu | Xihao Yuan | Hanting Chen | Kai Han | Xinghao Chen | Chengjun Zhan | Hanlin xu | Yichun Yin | Lifeng Shang | Feng Wen | Boxing Chen | Yufei Cui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Linrui Ma | Chun Hei Lo | Xinyu Wang | Peng Lu | Xihao Yuan | Hanting Chen | Kai Han | Xinghao Chen | Chengjun Zhan | Hanlin xu | Yichun Yin | Lifeng Shang | Feng Wen | Boxing Chen | Yufei Cui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quadratic computational cost of traditional attention mechanisms poses a major bottleneck to the scalability and practical deployment of large language models (LLMs), particularly in long-context scenarios. To improve efficiency, existing approaches often enforce rigid structural constraints such as local attention windows. However, these strategies typically lead to substantial performance degradation on tasks requiring precise long-range recall. In this work, we propose MATCH, a scalable and efficient framework that augments sparsified attention mechanisms with dynamically integrated in-context information through an efficient retrieval system. Empirical results show that MATCH significantly improves the performance of sparse-attention models on both synthetic and real-world natural-language tasks. These findings highlight the versatility of MATCH as a general approach for enhancing in-context retrieval capabilities while maintaining the efficiency benefits of sparse attention architectures.