2025
pdf
bib
abs
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Kayhan Behdin
|
Ata Fatahibaarzi
|
Qingquan Song
|
Yun Dai
|
Aman Gupta
|
Zhipeng Wang
|
Hejian Sang
|
Shao Tang
|
Gregory Dexter
|
Sirou Zhu
|
Siyu Zhu
|
Tejas Dharamsi
|
Vignesh Kothapalli
|
Zhoutong Fu
|
Yihan Cao
|
Pin-Lun Hsu
|
Fedor Borisyuk
|
Natesh S. Pillai
|
Luke Simon
|
Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
2023
pdf
bib
abs
BIT’s System for Multilingual Track
Zhipeng Wang
|
Yuhang Guo
|
Shuoying Chen
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the system we submitted to the IWSLT 2023 multilingual speech translation track, with input being English speech and output being text in 10 target languages. Our system consists of CNN and Transformer, convolutional neural networks downsample speech features and extract local information, while transformer extract global features and output the final results. In our system, we use speech recognition tasks to pre-train encoder parameters, and then use speech translation corpus to train the multilingual speech translation model. We have also adopted other methods to optimize the model, such as data augmentation, model ensemble, etc. Our system can obtain satisfactory results on test sets of 10 languages in the MUST-C corpus.
2021
pdf
bib
abs
BIT’s system for AutoSimulTrans2021
Mengge Liu
|
Shuoying Chen
|
Minqin Li
|
Zhipeng Wang
|
Yuhang Guo
Proceedings of the Second Workshop on Automatic Simultaneous Translation
In this paper we introduce our Chinese-English simultaneous translation system participating in AutoSimulTrans2021. In simultaneous translation, translation quality and delay are both important. In order to reduce the translation delay, we cut the streaming-input source sentence into segments and translate the segments before the full sentence is received. In order to obtain high-quality translations, we pre-train a translation model with adequate corpus and fine-tune the model with domain adaptation and sentence length adaptation. The experimental results on the evaluation data show that our system performs better than the baseline system.