SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Seanie Lee; Dong Bok Lee; Dominik Wagner; Minki Kang; Haebin Seong; Tobias Bocklet; Juho Lee; Sung Ju Hwang

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang

Abstract

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on “hard” examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model’s capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.

Anthology ID:: 2025.findings-acl.105
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2053–2069
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.105/
DOI:
Bibkey:
Cite (ACL):: Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, and Sung Ju Hwang. 2025. SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2053–2069, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models (Lee et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.105.pdf

PDF Cite Search Fix data