EdgeFormer: Latency-Aware Collaborative Multi-Head Attention of Transformer Inference in Edge Networks

Yiming Yao; Jianwei Niu; Bin Dai; Tao Ren

EdgeFormer: Latency-Aware Collaborative Multi-Head Attention of Transformer Inference in Edge Networks

Yiming Yao, Jianwei Niu, Bin Dai, Tao Ren

Abstract

Recent breakthroughs in Transformer-based large models, have driven widespread tasks, yet their reliance on centralized cloud deployment raises significant privacy risks due to sensitive data exposure. While edge-based collaborative inference offers a privacy-preserving alternative, existing methods face critical limitations: static model partitioning cannot adapt to dynamic edge resource fluctuations, and rigid multi-head attention handling overlooks semantic-critical prioritization and parallelism. We propose EdgeFormer, a latency-aware framework for distributed Transformer inference in resource-constrained edge networks. EdgeFormer dynamically allocates model blocks across devices via efficiency-storage trade-off optimization and introduces collaborative Multi-Head Attention (cMHA), which distributes semantic-critical attention heads across devices while pruning redundant ones under real-time constraints. We further develop LiScore, a composite metric integrating attention diversity and latency costs, alongside a similarity-based retrieval method to reduce recomputation overhead. Extensive experiments demonstrate that EdgeFormer achieves up to 2.01 \\times inference acceleration over state-of-the-art baselines with \\leq1.06% accuracy loss, maintaining robustness under varying edge conditions.

Anthology ID:: 2026.acl-long.2007
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43346–43361
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2007/
DOI:
Bibkey:
Cite (ACL):: Yiming Yao, Jianwei Niu, Bin Dai, and Tao Ren. 2026. EdgeFormer: Latency-Aware Collaborative Multi-Head Attention of Transformer Inference in Edge Networks. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43346–43361, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: EdgeFormer: Latency-Aware Collaborative Multi-Head Attention of Transformer Inference in Edge Networks (Yao et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2007.pdf
Checklist:: 2026.acl-long.2007.checklist.pdf

PDF Cite Search Checklist Fix data