LlmFixer: Fix the Helpfulness of Defensive Large Language Models

Zelong Yu, Xiaoming Zhang, Litian Zhang, Yu Yuan, Chaozhuo Li


Abstract
Defense strategies of large language models besides alignment are introduced to defend against jailbreak attacks, and they have managed to decrease the success rate of jailbreak attacks. However, these defense strategies weakened the helpfulness of large language models. In this work, we propose a universal framework, LlmFixer, acting on large language models equipped with any defense strategy to recover their original helpfulness. LlmFixer consists of an input prompt re-writer and a logic patch. The prompt re-writer is a pre-model for clarifying the intention of input prompts, which promotes large language models to be more helpful to benign inputs and more rejective to malicious inputs. The logic patch is a lightweight structure that enhances large language models’ comprehension capacity by supplementing certain logical relationships. Without updating the parameters of a defensive large language model, LlmFixer fixes its helpfulness while preserving safety. Experiments on three large language models, five jailbreak attacks, and four defense strategies show the effectiveness of LlmFixer.
Anthology ID:
2025.findings-emnlp.989
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18233–18247
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.989/
DOI:
10.18653/v1/2025.findings-emnlp.989
Bibkey:
Cite (ACL):
Zelong Yu, Xiaoming Zhang, Litian Zhang, Yu Yuan, and Chaozhuo Li. 2025. LlmFixer: Fix the Helpfulness of Defensive Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18233–18247, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
LlmFixer: Fix the Helpfulness of Defensive Large Language Models (Yu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.989.pdf
Checklist:
 2025.findings-emnlp.989.checklist.pdf