2025
pdf
bib
abs
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Yu Yan
|
Sheng Sun
|
Zenghao Duan
|
Teli Liu
|
Min Liu
|
Zhiyi Yin
|
LeiJingyu LeiJingyu
|
Qi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms.In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking.Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed.Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content.Experimental results demonstrate that AVATAR can effectively and transferably jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
pdf
bib
abs
PRDetect: Perturbation-Robust LLM-generated Text Detection Based on Syntax Tree
Xiang Li
|
Zhiyi Yin
|
Hexiang Tan
|
Shaoling Jing
|
Du Su
|
Yi Cheng
|
Huawei Shen
|
Fei Sun
Findings of the Association for Computational Linguistics: NAACL 2025
As LLM-generated text becomes increasingly prevalent on the internet, often containing hallucinations or biases, detecting such content has emerged as a critical area of research.Recent methods have demonstrated impressive performance in detecting text generated entirely by LLMs.However, in real-world scenarios, users often introduce perturbations to the LLM-generated text, and the robustness of existing detection methods against these perturbations has not been sufficiently explored.This paper empirically investigates this challenge and finds that even minor perturbations can severely degrade the performance of current detection methods. To address this issue, we find that the syntactic tree is minimally affected by disturbances and exhibits distinct differences between human-written and LLM-generated text.Therefore, we propose a detection method based on syntactic trees, which can capture features invariant to perturbations.It demonstrates significantly improved robustness against perturbation on the HC3 and GPT-3.5-mixed datasets.Moreover, it also has the shortest time expenditure.We provide the code and data at https://github.com/thulx18/PRDetect.
pdf
bib
Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject
Zenghao Duan
|
Wenbin Duan
|
Zhiyi Yin
|
Yinghan Shen
|
Shaoling Jing
|
Jie Zhang
|
Huawei Shen
|
Xueqi Cheng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)