HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices

Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, Jianxin Liao


Abstract
Micro-enterprises and individual developers emerge analysis demands for long sequence with powerful Large Language Models (LLMs). They try to deploy the LLMs at local, but only possess various commodity devices and the unreliable interconnection between devices. Existing parallel techniques do not lead to the same effectiveness in limited environment. The heterogeneity of devices, coupled with their limited capacity and expensive communication, brings challenges to private deployment for maximized utilization of available devices while masking latency. Hence, we introduce HPipe, a pipeline inference framework that successfully mitigates LLMs from high-performance clusters to heterogeneous commodity devices. By ensuring a balanced distribution of workloads, HPipe facilitates the parallel execution of LLMs through pipelining the sequences on the token dimension. The evaluation conducted on LLaMA-7B and GPT3-2B demonstrates that HPipe holds the potential for context analysis on LLM with heterogeneity devices, achieving an impressive speedup in latency and throughput up to 2.28 times.
Anthology ID:
2024.naacl-industry.1
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi Yang, Aida Davani, Avi Sil, Anoop Kumar
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2024.naacl-industry.1
DOI:
Bibkey:
Cite (ACL):
Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, and Jianxin Liao. 2024. HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 1–9, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices (Ma et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-checklist/2024.naacl-industry.1.pdf