Vignesh Kothapalli
2026
Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation
Wei-Rui Chen | Vignesh Kothapalli | Ata Fatahibaarzi | Hejian Sang | Shao Tang | Qingquan Song | Zhipeng Wang | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2026
Wei-Rui Chen | Vignesh Kothapalli | Ata Fatahibaarzi | Hejian Sang | Shao Tang | Qingquan Song | Zhipeng Wang | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2026
Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first 50% of tokens of every training sequence can retain, on average, ≈91% of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about 50% each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.
To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples
Vignesh Kothapalli | Ata Fatahibaarzi | Hamed Firooz | Maziar Sanjabi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vignesh Kothapalli | Ata Fatahibaarzi | Hamed Firooz | Maziar Sanjabi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.
2025
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations
Vignesh Kothapalli | Hamed Firooz | Maziar Sanjabi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vignesh Kothapalli | Hamed Firooz | Maziar Sanjabi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.