CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers

Longwei Zou, Qingyang Wang, Han Zhao, Jiangangkong Jiangangkong, Yi Yang, Yangdong Deng


Abstract
The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on LLaMA-33B, while maintaining a close level of performance.
Anthology ID:
2024.acl-long.394
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7293–7307
Language:
URL:
https://aclanthology.org/2024.acl-long.394
DOI:
Bibkey:
Cite (ACL):
Longwei Zou, Qingyang Wang, Han Zhao, Jiangangkong Jiangangkong, Yi Yang, and Yangdong Deng. 2024. CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7293–7307, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers (Zou et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.acl-long.394.pdf