2025
pdf
bib
abs
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data
Shenglai Zeng
|
Jiankun Zhang
|
Pengfei He
|
Jie Ren
|
Tianqi Zheng
|
Hanqing Lu
|
Han Xu
|
Hui Liu
|
Yue Xing
|
Jiliang Tang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. However, when the retrieval process involves private data, RAG systems may face severe privacy risks, potentially leading to the leakage of sensitive information. To address this issue, we propose using synthetic data as a privacy-preserving alternative for the retrieval data. We propose SAGE, a novel two-stage synthetic data generation paradigm. In the stage-1, we employ an attribute-based extraction and generation approach to preserve key contextual information from the original data. In the stage-2, we further enhance the privacy properties of the synthetic data through an agent-based iterative refinement process. Extensive experiments demonstrate that using our synthetic data as the retrieval context achieves comparable performance to using the original data while substantially reducing privacy risks. Our work takes the first step towards investigating the possibility of generating high-utility and privacy-preserving synthetic data for RAG, opening up new opportunities for the safe application of RAG systems in various domains.
pdf
bib
abs
Data Poisoning for In-context Learning
Pengfei He
|
Han Xu
|
Yue Xing
|
Hui Liu
|
Makoto Yamada
|
Jiliang Tang
Findings of the Association for Computational Linguistics: NAACL 2025
In-context learning (ICL) has emerged as a capability of large language models (LLMs), enabling them to adapt to new tasks using provided examples. While ICL has demonstrated its strong effectiveness, there is limited understanding of its vulnerability against potential threats. This paper examines ICL’s vulnerability to data poisoning attacks. We introduce ICLPoison, an attacking method specially designed to exploit ICL’s unique learning mechanisms by identifying discrete text perturbations that influence LLM hidden states. We propose three representative attack strategies, evaluated across various models and tasks. Our experiments, including those on GPT-4, show that ICL performance can be significantly compromised by these attacks, highlighting the urgent need for improved defense mechanisms to protect LLMs’ integrity and reliability.
pdf
bib
abs
Red-Teaming LLM Multi-Agent Systems via Communication Attacks
Pengfei He
|
Yuping Lin
|
Shen Dong
|
Han Xu
|
Yue Xing
|
Hui Liu
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
pdf
bib
abs
Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
Yang Nan
|
Pengfei He
|
Ravi Tandon
|
Han Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to diagnosing the source of uncertainty. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.
2024
pdf
bib
abs
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
|
Pengfei He
|
Han Xu
|
Yue Xing
|
Makoto Yamada
|
Hui Liu
|
Jiliang Tang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM’s representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
pdf
bib
abs
The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)
Shenglai Zeng
|
Jiankun Zhang
|
Pengfei He
|
Yiding Liu
|
Yue Xing
|
Han Xu
|
Jie Ren
|
Yi Chang
|
Shuaiqiang Wang
|
Dawei Yin
|
Jiliang Tang
Findings of the Association for Computational Linguistics: ACL 2024
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model generation with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. To this end, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risks brought by RAG on the retrieval data, we further discover that RAG can be used to mitigate the old risks, i.e., the leakage of the LLMs’ training data. In general, we reveal many new insights in this paper for privacy protection of retrieval-augmented LLMs, which could benefit both LLMs and RAG systems builders.
pdf
bib
abs
On the Generalization of Training-based ChatGPT Detection Methods
Han Xu
|
Jie Ren
|
Pengfei He
|
Shenglai Zeng
|
Yingqian Cui
|
Amy Liu
|
Hui Liu
|
Jiliang Tang
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models, such as ChatGPT, achieve amazing performance on various language processing tasks. However, they can also be exploited for improper purposes such as plagiarism or misinformation dissemination. Thus, there is an urgent need to detect the texts generated by LLMs. One type of most studied methods trains classification models to distinguish LLM texts from human texts. However, existing studies demonstrate the trained models may suffer from distribution shifts (during test), i.e., they are ineffective to predict the generated texts from unseen language tasks or topics which are not collected during training. In this work, we focus on ChatGPT as a representative model, and we conduct a comprehensive investigation on these methods’ generalization behaviors under distribution shift caused by a wide range of factors, including prompts, text lengths, topics, and language tasks. To achieve this goal, we first collect a new dataset with human and ChatGPT texts, and then we conduct extensive studies on the collected dataset. Our studies unveil insightful findings that provide guidance for future methodologies and data collection strategies for LLM detection.