LlmFixer: Fix the Helpfulness of Defensive Large Language Models
Zelong Yu, Xiaoming Zhang, Litian Zhang, Yu Yuan, Chaozhuo Li
Abstract
Defense strategies of large language models besides alignment are introduced to defend against jailbreak attacks, and they have managed to decrease the success rate of jailbreak attacks. However, these defense strategies weakened the helpfulness of large language models. In this work, we propose a universal framework, LlmFixer, acting on large language models equipped with any defense strategy to recover their original helpfulness. LlmFixer consists of an input prompt re-writer and a logic patch. The prompt re-writer is a pre-model for clarifying the intention of input prompts, which promotes large language models to be more helpful to benign inputs and more rejective to malicious inputs. The logic patch is a lightweight structure that enhances large language models’ comprehension capacity by supplementing certain logical relationships. Without updating the parameters of a defensive large language model, LlmFixer fixes its helpfulness while preserving safety. Experiments on three large language models, five jailbreak attacks, and four defense strategies show the effectiveness of LlmFixer.- Anthology ID:
- 2025.findings-emnlp.989
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 18233–18247
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.989/
- DOI:
- 10.18653/v1/2025.findings-emnlp.989
- Cite (ACL):
- Zelong Yu, Xiaoming Zhang, Litian Zhang, Yu Yuan, and Chaozhuo Li. 2025. LlmFixer: Fix the Helpfulness of Defensive Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18233–18247, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- LlmFixer: Fix the Helpfulness of Defensive Large Language Models (Yu et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.989.pdf