FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

Akriti Jain; Saransh Sharma; Koyel Mukherjee; Soumyabrata Pal

doi:10.18653/v1/2025.findings-emnlp.1197

FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal

Abstract

Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language tasks. However, due to sequential processing through multiple transformer layers, autoregressive decoding faces significant computational challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors: (1) early exit, and (2) input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations, the former cannot be applied in the presence of KV caching, which is essential for speed-ups in modern inference frameworks, and the latter fails to capture variation in layer importance across tasks or, more generally, across input sequences. To address these limitations, we propose FiRST, a model-agnostic framework that reduces inference latency by using layer-specific routers to adaptively skip transformer layers during decoding, based on routing decisions made from the input prompt in the prefill stage. FiRST remains fully compatible with KV caching, enabling faster decoding while maintaining quality. Our method reveals that input adaptivity is essential: Different tasks rely on different subsets of layers to evolve meaningful representations. Extensive experiments show that FiRST significantly reduces latency while outperforming existing layer selection strategies in quality. It retains performance comparable to the base model without skipping. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.

Anthology ID:: 2025.findings-emnlp.1197
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21957–21975
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1197/
DOI:: 10.18653/v1/2025.findings-emnlp.1197
Bibkey:
Cite (ACL):: Akriti Jain, Saransh Sharma, Koyel Mukherjee, and Soumyabrata Pal. 2025. FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21957–21975, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction (Jain et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1197.pdf
Checklist:: 2025.findings-emnlp.1197.checklist.pdf

PDF Cite Search Checklist Fix data