BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian; Sujan Kumar Gonugondla; Sungsoo Ha; Mingyue Shang; Sanjay Krishna Gouda; Ramesh Nallapati; Sudipta Sengupta; Xiaofei Ma; Anoop Deoras

doi:10.18653/v1/2024.findings-acl.489

BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras

Abstract

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15× speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3× the highest of that of regular decoding and around 10× of single-sequence speculative decoding.

Anthology ID:: 2024.findings-acl.489
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8214–8224
Language:
URL:: https://aclanthology.org/2024.findings-acl.489
DOI:: 10.18653/v1/2024.findings-acl.489
Bibkey:
Cite (ACL):: Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, and Anoop Deoras. 2024. BASS: Batched Attention-optimized Speculative Sampling. In Findings of the Association for Computational Linguistics: ACL 2024, pages 8214–8224, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: BASS: Batched Attention-optimized Speculative Sampling (Qian et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-acl.489.pdf

PDF Search