Yihan Cao
2025
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Kayhan Behdin
|
Ata Fatahibaarzi
|
Qingquan Song
|
Yun Dai
|
Aman Gupta
|
Zhipeng Wang
|
Hejian Sang
|
Shao Tang
|
Gregory Dexter
|
Sirou Zhu
|
Siyu Zhu
|
Tejas Dharamsi
|
Vignesh Kothapalli
|
Zhoutong Fu
|
Yihan Cao
|
Pin-Lun Hsu
|
Fedor Borisyuk
|
Natesh S. Pillai
|
Luke Simon
|
Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
2023
API-Assisted Code Generation for Question Answering on Varied Table Structures
Yihan Cao
|
Shuyi Chen
|
Ryan Liu
|
Zhiruo Wang
|
Daniel Fried
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
A persistent challenge to table question answering (TableQA) by generating executable programs has been adapting to varied table structures, typically requiring domain-specific logical forms. In response, this paper introduces a unified TableQA framework that: (1) provides a unified representation for structured tables as multi-index Pandas data frames, (2) uses Python as a powerful querying language, and (3) uses few-shot prompting to translate NL questions into Python programs, which are executable on Pandas data frames. Furthermore, to answer complex relational questions with extended program functionality and external knowledge, our framework allows customized APIs that Python programs can call. We experiment with four TableQA datasets that involve tables of different structures — relational, multi-table, and hierarchical matrix shapes — and achieve prominent improvements over past state-of-the-art systems. In ablation studies, we (1) show benefits from our multi-index representation and APIs over baselines that use only an LLM, and (2) demonstrate that our approach is modular and can incorporate additional APIs.
Search
Fix author
Co-authors
- Kayhan Behdin 1
- Fedor Borisyuk 1
- Shuyi Chen 1
- Yun Dai 1
- Gregory Dexter 1
- show all...