Weicong Chen
2026
Quantize What Counts: More for Keys, Less for Values
Mohsen Hariri | Alan Luo | Weicong Chen | Tianyi Zhang | Qifan Wang | Xiaotian Han | Vipin Chaudhary
Findings of the Association for Computational Linguistics: ACL 2026
Mohsen Hariri | Alan Luo | Weicong Chen | Tianyi Zhang | Qifan Wang | Xiaotian Han | Vipin Chaudhary
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key weight matrices systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
2019
Microsoft Research Asia’s Systems for WMT19
Yingce Xia | Xu Tan | Fei Tian | Fei Gao | Di He | Weicong Chen | Yang Fan | Linyuan Gong | Yichong Leng | Renqian Luo | Yiren Wang | Lijun Wu | Jinhua Zhu | Tao Qin | Tie-Yan Liu
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Yingce Xia | Xu Tan | Fei Tian | Fei Gao | Di He | Weicong Chen | Yang Fan | Linyuan Gong | Yichong Leng | Renqian Luo | Yiren Wang | Lijun Wu | Jinhua Zhu | Tao Qin | Tie-Yan Liu
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).