Laurens Van Der Maas
2025
Speed Without Sacrifice: Fine-Tuning Language Models with Medusa and Knowledge Distillation in Travel Applications
Daniel Zagyva
|
Emmanouil Stergiadis
|
Laurens Van Der Maas
|
Aleksandra Dokic
|
Eran Fainman
|
Ilya Gusev
|
Moran Beladev
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
In high-stakes industrial NLP applications, balancing generation quality with speed and efficiency presents significant challenges. We address them by investigating two complementary optimization approaches: Medusa for speculative decoding and knowledge distillation (KD) for model compression. We demonstrate the practical application of these techniques in real-world travel domain tasks, including trip planning, smart filters, and generating accommodation descriptions. We introduce modifications to the Medusa implementation, starting with base pre-trained models rather than conversational fine-tuned ones, and developing a simplified single-stage training process for Medusa-2 that maintains performance while reducing computational requirements. Lastly, we present a novel framework that combines Medusa with knowledge distillation, achieving compounded benefits in both model size and inference speed. Our experiments with TinyLlama-1.1B as the student model and Llama-3.1-70B as the teacher show that the combined approach maintains the teacher’s performance quality while reducing inference latency by 10-20x.
Search
Fix author
Co-authors
- Moran Beladev 1
- Aleksandra Dokic 1
- Eran Fainman 1
- Ilya Gusev 1
- Emmanouil Stergiadis 1
- show all...
Venues
- acl1