FREE: Fast and Robust Vision Language Models with Early Exits

Divya Jyoti Bajpai, Manjesh Kumar Hanawal


Abstract
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51× while retaining comparable performance. The anonymized source code is available at https://github.com/Div290/BLIPEE.
Anthology ID:
2025.findings-acl.1209
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23599–23615
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1209/
DOI:
Bibkey:
Cite (ACL):
Divya Jyoti Bajpai and Manjesh Kumar Hanawal. 2025. FREE: Fast and Robust Vision Language Models with Early Exits. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23599–23615, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
FREE: Fast and Robust Vision Language Models with Early Exits (Bajpai & Hanawal, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1209.pdf