Yongmei Zhou (周咏梅)

Yongmei Zhou

Also published as: 咏梅周

2025

Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model’s defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the “defense”. intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model’s confidence and guidance in “defensive” intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.

2022

pdf bib abs
基于关系图注意力网络和宽度学习的负面情绪识别方法(Negative Emotion Recognition Method Based on Rational Graph Attention Network and Broad Learning)
Sancheng Peng (彭三城) | Guanghao Chen (陈广豪) | Lihong Cao (曹丽红) | Rong Zeng (曾嵘) | Yongmei Zhou (周咏梅) | Xinguang Li (李心广)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“对话文本负面情绪识别主要是从对话文本中识别出每个话语的负面情绪,近年来已成为了一个研究热点。然而,让机器在对话文本中识别负面情绪是一项具有挑战性的任务,因为人们在对话中的情感表达通常存在上下文关系。为了解决上述问题,本文提出一种基于关系图注意力网络(Rational Graph Attention Network, RGAT)和宽度学习(Broad Learning, BL)的对话文本负面情绪识别方法,即RGAT-BL。该方法采用预训练模型RoBERTa生成对话文本的初始向量;然后,采用Bi-LSTM对文本向量的局部特征和上下文语义特征进行提取,从而获取话语级别的特征;采用RGAT对说话者之间的长距离依赖关系进行提取,从而获取说话者级别的特征;采用BL对上述两种拼接后的特征进行处理,从而实现对负面情绪进行分类输出。通过在三种数据集上与基线模型进行对比实验,结果表明所提出的方法在三个数据集上的weighted-F 1、macroF 1值都优于基线模型。”

Co-authors

Nankai Lin 1

Sancheng Peng (彭三城) 1

Venues

acl1
ccl1

Fix author