Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Huan Li, Cong Wang


Abstract
Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce Double (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval Precision-Efficiency Dilemma through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. Double is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of 5.3× on LLaMA3.3-70B and 2.8× on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training. Our code is available at https://github.com/Sylvan820/Double1.
Anthology ID:
2026.acl-long.879
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19242–19263
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.879/
DOI:
Bibkey:
Cite (ACL):
Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Huan Li, and Cong Wang. 2026. Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19242–19263, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism (Shen et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.879.pdf
Checklist:
 2026.acl-long.879.checklist.pdf