Zihao Yu


2025

pdf bib
COSMIC: Generalized Refusal Direction Identification in LLM Activations
Vincent Siu | Nicholas Crispino | Zihao Yu | Sam Pan | Zhun Wang | Yang Liu | Dawn Song | Chenguang Wang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models encode behaviors like refusal within their activation space, but identifying these behaviors remains challenging. Existing methods depend on predefined refusal templates detectable in output tokens or manual review. We introduce **COSMIC** (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that optimally identifies steering directions and target layers using cosine similarity, entirely independent of output text. COSMIC achieves steering effectiveness comparable to prior work without any prior knowledge or assumptions of a model’s refusal behavior such as the use of certain refusal tokens. Additionally, COSMIC successfully identifies refusal directions in adversarial scenarios and models with weak safety alignment, demonstrating its robustness across diverse settings.