Pingwei Sun
2026
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Jiacheng Li | Jianchao Tan | Zhidong Yang | Pingwei Sun | Feiye Huo | Jiayu Qin | Xiangyu Zhang | Maoxin He | Guangming Tan | Weile Jia | Xunliang Cai | Tong Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Jiacheng Li | Jianchao Tan | Zhidong Yang | Pingwei Sun | Feiye Huo | Jiayu Qin | Xiangyu Zhang | Maoxin He | Guangming Tan | Weile Jia | Xunliang Cai | Tong Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns—without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model’s training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.
Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
Yuxuan Hu | Jianchao Tan | Jiaqi Zhang | Wen Zan | Pingwei Sun | Yifan Lu | Xunliang Cai | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yuxuan Hu | Jianchao Tan | Jiaqi Zhang | Wen Zan | Pingwei Sun | Yifan Lu | Xunliang Cai | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression/selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA’s branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50% versus NSA while improving the model’s common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.