De-Chuan Zhan


2026

As the scale of Large Language Models (LLMs) continues to grow rapidly, the cost of training and inference has significantly increased, limiting their application in resource-constrained scenarios. To address this challenge, model pruning has been widely used to reduce computational complexity. Among various pruning strategies, block-wise pruning has gained popularity due to its ability to accelerate computation by removing entire blocks of parameters. However, existing methods often rely on hard labels from calibration datasets and neglect the cumulative effects of pruning on subsequent blocks. To address this, we propose two complementary techniques: the Logit Disruption Score (LDS), a novel block importance criterion that measures the impact of pruning by comparing the cosine similarity between the logits of the original and pruned models, focusing on the most informative logit dimensions to better preserve the model’s core capabilities; and Activation Statistics Correction (ASC), an affine transformation mechanism that aligns the mean and variance of activations in the pruned model with those of the original model, effectively mitigating the distribution shift caused by block removal and improving the information flow in subsequent blocks. Experiments across multiple datasets show that our approach reduces reliance on calibration data and improves generalization, achieving competitive results with existing methods.

2025

Knowledge distillation (KD) is a widely used approach for BERT compression, where a larger BERT model serves as a teacher to transfer knowledge to a smaller student model. Prior works have found that distilling a larger BERT with superior performance may degrade student’s performance than a smaller BERT. In this paper, we investigate the limitations of existing KD methods for larger BERT models. Through Canonical Correlation Analysis, we identify that these methods fail to fully exploit the potential advantages of larger teachers. To address this, we propose an improved distillation approach that effectively enhances knowledge transfer. Comprehensive experiments demonstrate the effectiveness of our method in enabling larger BERT models to distill knowledge more efficiently.
Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose **DART** (**D**istilling **A**utoregressive **R**easoning to Silent **T**hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART offers significant performance gains compared with existing non-autoregressive baselines without extra inference latency, serving as a feasible alternative for efficient reasoning.