SafeQuant: LLM Safety Analysis via Quantized Gradient Inspection

Sindhu Padakandla, Sadbhavana Babar, Rathod Darshan D, Manohar Kaul


Abstract
Contemporary jailbreak attacks on Large Language Models (LLMs) employ sophisticated techniques with obfuscated content to bypass safety guardrails. Existing defenses either use computationally intensive LLM verification or require adversarial fine-tuning, leaving models vulnerable to advanced attacks. We introduce SafeQuant, a novel defense framework that leverages quantized gradient patterns to identify harmful prompts efficiently. Our key insight is that when generating identical responses like “Sure”, LLMs exhibit distinctly different internal gradient patterns for safe versus harmful prompts, reflecting conflicts with safety training. By capturing these patterns through selective gradient masking and quantization, SafeQuant significantly outperforms existing defenses across multiple benchmarks while maintaining model utility. The method demonstrates particular effectiveness against sophisticated attacks like WordGame prompts and persuasive adversarial attacks, achieving an F1-score of 0.80 on WordGame dataset and outperforming state-of-the-art (SoTA) methods like GradSafe by an absolute margin of 57%.
Anthology ID:
2025.naacl-long.127
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2522–2536
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.127/
DOI:
Bibkey:
Cite (ACL):
Sindhu Padakandla, Sadbhavana Babar, Rathod Darshan D, and Manohar Kaul. 2025. SafeQuant: LLM Safety Analysis via Quantized Gradient Inspection. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2522–2536, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
SafeQuant: LLM Safety Analysis via Quantized Gradient Inspection (Padakandla et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.127.pdf