Cong Gao
2025
Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model
Cong Gao
|
Bo Zhang
|
Linkang Yang
|
Minghao Hu
|
Zhunchen Luo
|
Xiaoying Bai
|
Guotong Geng
|
Jun Zhang
|
Yunhua Xue
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have achieved significant advances but can potentially generate harmful content such as social biases, extremism, and misinformation. Red teaming is a promising approach to enhance model safety by creating adversarial prompts to test and improve model robustness. However, existing red-teaming methods often require expensive fine-tuning, especially for large LLMs. We propose the Dynamic Evil Score-Guided Decoding framework (DESGD), an efficient red-teaming method that does not increase computational cost with the target model size. DESGD introduces the concept of an ‘evil score’ to dynamically evaluate the potential of tokens to contribute to harmful outputs during decoding. This framework constructs a small unsafe model using an adversarial dataset and adjusts the logits vector of the target model based on the evil score. Experiments show that DESGD achieves an ASR of 92.83% on the Llama-3.2-3B-Instruct model, compared to 83.48% with adversarial fine-tuning while using less computational resources. Similarly, on the Qwen2.5-3B-Instruct model, DESGD reaches an ASR of 88.62%, outperforming adversarial fine-tuning (77.56%).
2019
Learning to Learn Sales Prediction with Social Media Sentiment
Zhaojiang Lin
|
Andrea Madotto
|
Genta Indra Winata
|
Zihan Liu
|
Yan Xu
|
Cong Gao
|
Pascale Fung
Proceedings of the First Workshop on Financial Technology and Natural Language Processing
Search
Fix author
Co-authors
- Xiaoying Bai 1
- Pascale Fung 1
- Guotong Geng 1
- Minghao Hu 1
- Zhaojiang Lin 1
- show all...