Zenghao Duan


2025

pdf bib
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Yu Yan | Sheng Sun | Zenghao Duan | Teli Liu | Min Liu | Zhiyi Yin | LeiJingyu LeiJingyu | Qi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms.In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking.Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed.Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content.Experimental results demonstrate that AVATAR can effectively and transferably jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

pdf bib
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
Yu Yan | Sheng Sun | Zhe Wang | Yijun Lin | Zenghao Duan | Zhifei Zheng | Min Liu | Zhiyi Yin | Jianping Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

With the development of Large Language Models (LLMs), numerous efforts have revealed their vulnerabilities to jailbreak attacks. Although these studies have driven the progress in LLMs’ safety alignment, it remains unclear whether LLMs have internalized authentic knowledge to deal with real-world crimes, or are merely forced to simulate toxic language patterns. This ambiguity raises concerns that jailbreak success is often attributable to a hallucination loop between jailbroken LLM and judger LLM. By decoupling the use of jailbreak techniques, we construct knowledge-intensive Q&A to investigate the misuse threats of LLMs in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment robustness. Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns. Our study reveals a gap between existing LLM safety assessments and real-world threat potential.

pdf bib
Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject
Zenghao Duan | Wenbin Duan | Zhiyi Yin | Yinghan Shen | Shaoling Jing | Jie Zhang | Huawei Shen | Xueqi Cheng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)