Jianchen Zhu
2026
Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
Hong Huang | Decheng Wu | Qiangqiang Hu | Guanghua Yu | Jinhai Yang | Jianchen Zhu | Xue Liu | Dapeng Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hong Huang | Decheng Wu | Qiangqiang Hu | Guanghua Yu | Jinhai Yang | Jianchen Zhu | Xue Liu | Dapeng Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to -1, 0, +1, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim.
Reinforcement Learning on Pre-Training Data
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of . For example, RLPT yields substantial improvements in continual pre-training (+4.6%) and provides a strong foundation for post-training (+3.4%) on Qwen3-8B-Base.
2023
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Hanlin Tang | Yifu Sun | Decheng Wu | Kai Liu | Jianchen Zhu | Zhanhui Kang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Hanlin Tang | Yifu Sun | Decheng Wu | Kai Liu | Jianchen Zhu | Zhanhui Kang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
Search
Fix author
Co-authors
- Kai Liu 2
- Decheng Wu 2
- Zheng Fang 1
- Fei Gao 1
- Xue Gong 1
- Qiangqiang Hu 1
- Guanhua Huang 1
- Hong Huang 1
- Cheng Jiang 1
- Yuhao Jiang 1
- Zhanhui Kang 1
- Shuai LI 1
- Wai Lam 1
- Kejiao Li 1
- Kun Li 1
- Siheng Li 1
- Xiaoxue Li 1
- Zhuoyu Li 1
- Qibin Liu 1
- Xue Liu 1
- Kun Shi 1
- Yifu Sun 1
- Hanlin Tang 1
- Yangyu Tao 1
- Bochao Wang 1
- Di Wang 1
- Dapeng Wu 1
- Haoyuan Wu 1
- Wujiajia 1
- Ruibin Xiong 1
- Guanghui Xu 1
- Tingqiang Xu 1
- Zenan Xu 1
- Jinbao Xue 1
- Jianfeng Yan 1
- Jinhai Yang 1
- Qi Yi 1
- Guanghua Yu 1
- Yuyuan Zeng 1
- Chenchen Zhang 1
- Zihao Zheng 1
- Bo Zhou 1
- Zhijiang xu 1