FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression

Juhyeong Kim; Sangyeon Yu; Gyunyeop Kim; Sangwoo Kang

doi:10.18653/v1/2025.findings-emnlp.1286

FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression

Juhyeong Kim, Sangyeon Yu, Gyunyeop Kim, Sangwoo Kang

Abstract

Autoregressive decoding in large language models (LLMs) necessitates a full forward pass for each generated token, significantly increasing inference latency. To address this limitation, we propose Fractal-LLM, a lossless self-speculative decoding method that embeds a compressed model within selected decoder layers of the original model. Specifically, our approach generates multiple draft tokens in parallel by injecting compressed layers into selected decoder layers. These draft tokens are subsequently verified through a single forward pass of the original model, ensuring the final outputs exactly match those produced by the original model. Experimental results across diverse benchmarks—including GSM8K, XSUM, CNN/DailyMail, and HumanEval—demonstrate that our method achieves substantial inference speed-ups (up to 2.47×) compared to standard autoregressive decoding, without requiring any additional training.

Anthology ID:: 2025.findings-emnlp.1286
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23666–23673
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1286/
DOI:: 10.18653/v1/2025.findings-emnlp.1286
Bibkey:
Cite (ACL):: Juhyeong Kim, Sangyeon Yu, Gyunyeop Kim, and Sangwoo Kang. 2025. FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23666–23673, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression (Kim et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1286.pdf
Checklist:: 2025.findings-emnlp.1286.checklist.pdf

PDF Cite Search Checklist Fix data