Zihao Yu
2025
COSMIC: Generalized Refusal Direction Identification in LLM Activations
Vincent Siu
|
Nicholas Crispino
|
Zihao Yu
|
Sam Pan
|
Zhun Wang
|
Yang Liu
|
Dawn Song
|
Chenguang Wang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models encode behaviors like refusal within their activation space, but identifying these behaviors remains challenging. Existing methods depend on predefined refusal templates detectable in output tokens or manual review. We introduce **COSMIC** (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that optimally identifies steering directions and target layers using cosine similarity, entirely independent of output text. COSMIC achieves steering effectiveness comparable to prior work without any prior knowledge or assumptions of a model’s refusal behavior such as the use of certain refusal tokens. Additionally, COSMIC successfully identifies refusal directions in adversarial scenarios and models with weak safety alignment, demonstrating its robustness across diverse settings.
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu
|
Zhuohao Ni
|
Nicholas Crispino
|
Zihao Yu
|
Michael Bendersky
|
Beliz Gunel
|
Ruoxi Jia
|
Xin Liu
|
Lingjuan Lyu
|
Dawn Song
|
Chenguang Wang
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
Search
Fix author
Co-authors
- Nicholas Crispino 2
- Dawn Song 2
- Chenguang Wang (王晨光) 2
- Michael Bendersky 1
- Beliz Gunel 1
- show all...