Siyu Zhu
2026
T⋆: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Hanchen Xia | Baoyou Chen | Yutang Ge | Guojiang Zhao | Siyu Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hanchen Xia | Baoyou Chen | Yutang Ge | Guojiang Zhao | Siyu Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We present T⋆, a simple TraceRL-based curriculum for progressive block-size scaling in masked diffusion language models (MDMs).Starting from an AR-initialized small-block MDM, T⋆ gradually increases the block size while re-optimizing the denoising policy at each stage, enabling higher-parallelism decoding with limited degradation on math reasoning benchmarks. Across two SDAR scales and three benchmarks, T⋆ consistently outperforms direct large-block TraceRL and is substantially more stable during training. Our schedule analysis suggests that the learned policy does not simply revert to a strictly left-to-right order; instead, it retains block-size-specific non-monotone updates while improving accuracy.
2025
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.