Bingxuan Wang
2026
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Xin Cheng | Wangding Zeng | Damai Dai | Qinyu Chen | Bingxuan Wang | Zhenda Xie | Kezhao Huang | Xingkai Yu | Zhewen Hao | Han Zhang | Yu-Kun Li | Huishuai Zhang | Dongyan Zhao | Wenfeng Liang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xin Cheng | Wangding Zeng | Damai Dai | Qinyu Chen | Bingxuan Wang | Zhenda Xie | Kezhao Huang | Xingkai Yu | Zhewen Hao | Han Zhang | Yu-Kun Li | Huishuai Zhang | Dongyan Zhao | Wenfeng Liang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mixture-of-Experts (MoE) scales capacity via conditional computation, but Transformers lack a native knowledge lookup primitive. We introduce conditional memory, instantiated via Deep Sparse Embedding (DSE), which indexes a massive embedding table using local n-grams for retrieval. We formalize sparsity allocation problem—how to split a fixed parameter budget between MoE experts and DSE memory—and find a U-shaped scaling law that identifies an optimal balance. Scaling to 27B parameters, DSE outperform an iso-parameter and iso-FLOPs MoE baseline across knowledge and reasoning benchmarks, and achieve markedly stronger long-context performance. Mechanistic analyses show that DSE offloads early-layer static recall into memory, freeing effective depth and attention for higher-level reasoning. DSE is also infrastructure-efficient: its deterministic hashing enables offloading massive parameters into host memory during inference with negligible throughput overhead.