SHARP: Self-adaptive Harmful Category-aware Prompt Generation for Black-box Jailbreaking

Yingjie Xue, Xingyou Xia, Jun Zhang, Yunbo Cao, Dengpan Ye, Guotong Geng, Fei Li


Abstract
Large Language Models (LLMs) have been widely applied in various domains such as education and healthcare, making safety assurance crucial. Jailbreak attacks, a method used in red-teaming, can help evaluate and improve the defensive strategies of LLMs. However, existing jailbreak methods often overlook the semantic differences across categories of harmful questions, leading to inconsistent success rates and reduced overall attack effectiveness. We propose the first category-aware jailbreak framework, SHARP, which incorporates the semantic category of harmful questions into prompt generation. Trained on a verified jailbreak dataset, SHARP enables the model to learn category-specific semantic features and adaptively generate prompts that bypass safety mechanisms. The method combines two-stage LoRA fine-tuning, and DPO-based reinforcement learning to optimize both attack success and category alignment. Experiments show that SHARP significantly improves attack success rates and achieves better cross-category robustness compared to the state-of-the-art (SOTA) baselines, providing an efficient and scalable tool for evaluating LLM safety.
Anthology ID:
2026.acl-long.2100
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45291–45303
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2100/
DOI:
Bibkey:
Cite (ACL):
Yingjie Xue, Xingyou Xia, Jun Zhang, Yunbo Cao, Dengpan Ye, Guotong Geng, and Fei Li. 2026. SHARP: Self-adaptive Harmful Category-aware Prompt Generation for Black-box Jailbreaking. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45291–45303, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SHARP: Self-adaptive Harmful Category-aware Prompt Generation for Black-box Jailbreaking (Xue et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2100.pdf
Checklist:
 2026.acl-long.2100.checklist.pdf