Michael Backes - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Michael Backes

2026

Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang | Yixin Wu | Rui Wen | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

With the rapid adoption of large language models (LLMs), conversational AI agents have become widely deployed across real-world applications. To enhance safety, these agents are often equipped with guardrails that moderate harmful content. Identifying the guardrails in an agent thus becomes critical for adversaries to understand the system and design guard-specific attacks. In this work, we introduce AP-Test, a novel approach that leverages guard-specific adversarial prompts to detect the identity of guardrails deployed in black-box AI agents. Our method addresses key challenges in this task, including the influence of safety-aligned LLMs and other guardrails, as well as a lack of principled decision-making strategies. AP-Test employs two complementary testing strategies, input and output guard tests, and a new metric, match score, to enable robust identification. Experiments across diverse agents and four open-source guardrails demonstrate that AP-Test achieves perfect classification accuracy in multiple scenarios. Ablation studies further highlight the necessity of our proposed components. Our findings reveal a practical path toward guardrail identification in real-world AI systems.

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality
Zeyuan Chen | Ziqing Yang | Yihan Ma | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness.A trend of using large language models (LLMs) for assistance has emerged.In this work, we take a critical step toward improving the quality of LLM-generated reviews.We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to increase LLM-human similarity (RQ2).We first analyzed the human-written reviews with reviews generated by GPT-4o, Claude-3.7-Sonnet, and DeepSeek-V3 and found that LLMs and humans focus on different terms, e.g., LLMs prioritize theory while humans emphasize methodology and experiments.We further adopt prompt engineering, such as Chain-of-Thought (CoT), and utilize retrieval-augmented generation (RAG) to enhance the LLM-generated reviews towards human-level quality.We find CoT significantly improves the human similarity of LLM reviews, while we also discover an unexpected “RAG paradox,” i.e., experiments with RAG produce different results for various LLMs and, in some cases, even reduce review quality.Our comprehensive analysis of LLM-generated academic reviews illustrates both possibilities and limitations, contributing to a more effective, human-aligned review system.

Rethinking Assessments of Prompt Injection Attacks
Chi Cui | Yixin Wu | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

Prompt injection attacks are recognized as one of the primary risks faced by LLM-integrated applications in recent years. However, common evaluation frameworks remain insufficient, lacking comprehensiveness and real-world relevance. To bridge this gap, we revisit the common evaluation framework and conduct an extensive evaluation across eight different evaluation settings, including 37 real-world applications, 185 injected tasks, 21 attack instructions, and a total of 143,745 queries. The evaluation highlights several findings. For example, real-world applications are more vulnerable to prompt injection attacks compared to those used in research settings. While complex attack instructions are more sophisticated, they are less effective than simple attack instructions. We further conduct an assessment of both prompt-level and model-level defense mechanisms and highlight their limitations in real-world applications. By exploring more diverse scenarios across different dimensions, our framework provides a solid foundation for assessing vulnerabilities in LLM-integrated applications and evaluating the efficacy of defensive strategies.

InferPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents
Yixin Wu | Rui Wen | Chi Cui | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non-experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose InferPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT-4o, achieves a 100.0% task completion rate and near-expert attack performance, with an average token cost of only 0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi-agent framework and task-specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non-expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.

Reward Yourself: Efficient Self Rewards for Trustworthy Sampling
Mingjie Li | Wai Man Si | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

As high-quality data becomes harder to obtain, reward models are increasingly important. Beyond the costly RLHF stage, they are now used at inference time to guide LLM generation and in data selection for post-training. These methods bring efficiency and performance gains, but current reward models often fail to prevent untrustworthy behaviors such as privacy leaks and stereotypes. Re-training reward models to address these issues is expensive, since it requires large-scale human preference data. We propose SelfRW, a lightweight intrinsic reward that needs no extra fine-tuning or auxiliary models. By pruning current LLMs to approximate an “trust” and an “untrust” token distribution, we compute the log-probability difference as an auxiliary reward. When integrated into reward-guided sampling, SelfRW significantly reduces untrustworthy outputs while preserving task performance. It also improves reward-guided data selection, yielding better post-trained models. Experiments with two reward models and four LLMs on privacy, bias, and stereotype benchmarks show that combining SelfRW consistently improves trustworthiness (over 10% in privacy tasks and 20% in bias tasks) with minimal impact on general utility benchmarks.

DE-CLIP: Few-Shot Anomaly Detection via Difference-Guided Embedding Editing
Yage Zhang | Yukun Jiang | Michael Backes | Yang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Anomaly detection (AD) plays a critical role in applications such as automated industrial inspection and medical image analysis. Empowered by the strong pre-trained vision-language model, CLIP, recent years have witnessed the emergence of several CLIP-based few-shot AD methods.Due to the overlap between the embedding distributions of normal and anomalous samples, many existing approaches introduce additional model training for more discriminative text embeddings.However, we demonstrate that such training is not necessary.Specifically, we find that this embedding overlap can be separated by introducing a ̲Difference-guided vector for embedding ̲Editing (DiffEdit).Based on this finding, we propose DE-CLIP, a simple yet effective framework based on DiffEdit, which directly edits text embeddings based on the textual and visual differences between normal and anomalous samples, resulting in more discriminative embeddings for AD.Extensive experiments on industrial and medical datasets demonstrate the superiority of our proposed DE-CLIP compared with existing baselines.For instance, on MVTec dataset, DE-CLIP achieves 96.6% and 96.7% AUROC on anomaly classification and segmentation, surpassing both training-based and training-free methods.In addition, we observe that introducing DiffEdit into other training-free baselines could also significantly improve their performance, highlighting the potential of DiffEdit to promote better AD.

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?
Yuan Xin | Dingfan Chen | Linyi Yang | Michael Backes | Xiao Zhang
Findings of the Association for Computational Linguistics: ACL 2026

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present a systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly all evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have overestimated the practical success of these attacks; second, while safety filters are effective in detection, there remains room to better balance recall and precision to further optimize protection and user experience.We highlight critical gaps and call for further refinement of detection accuracy and usability in LLM safety systems.

Open Schrödinger’s Closed Box: Identifying Retrieval Augmented Generation in API-Accessible Large Language Model Services
Yukun Jiang | Xinyue Shen | Michael Backes | Zheng Li | Yang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are powerful at question-answering but prone to hallucinations due to limited domain-specific or up-to-date knowledge. Retrieval augmented generation (RAG) mitigates this by adding an external retriever and knowledge database, yet RAG remains vulnerable to targeted attacks that degrade outputs or manipulate opinions. Prior attacks typically assume adversaries know the service is RAG-enhanced and may even know deployment details, an assumption often invalid for real-world commercial LLMs that expose only black-box APIs.This opacity also risks misleading users about system capabilities. This work aims to bridge this gap by proposing RAG-ID, a framework for ̲IDentifying ̲RAG properties in LLM services.We classify adversaries into three knowledge levels and design six attack methods. Experiments show these attacks reliably detect RAG — up to 99.97% accuracy with partial or no optional knowledge, and nearly 100% when the LLM and database are known. After detection, RAG-ID can infer finer RAG properties (e.g., deployed LLM and knowledge database). We consider RAG-ID a reconnaissance tool for attackers, a way to facilitate users’ transparent selection of LLM services, and a guide for RAG developers in refining security measures.

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Wai Man Si | Mingjie Li | Michael Backes | Yang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain “unsafe tickets” responsible for harmful behaviors, and pruning reveals “safety tickets” that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
Rui Zhang | Hongwei Li | Yun Shen | Xinyue Shen | Wenbo Jiang | Guowen Xu | Yang Liu | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in misalignment. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as realignment, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs.

2025

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu | Yugeng Liu | Ziqing Yang | Xinyue Shen | Michael Backes | Yang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Jailbreak attacks aim to bypass the LLMs’ safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation—either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks, which could achieve high attack success rates but are easy to mitigate by defenses. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.

When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs
Xinyue Shen | Yun Shen | Michael Backes | Yang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge files have been widely used in large language model (LLM)-powered agents, such as GPTs, to improve response quality. However, concerns over the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.

Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media
Zhen Sun | Zongmin Zhang | Xinyue Shen | Ziyi Zhang | Yule Liu | Michael Backes | Yang Zhang | Xinlei He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Social media platforms are experiencing a growing presence of AI-Generated Texts (AIGTs). However, the misuse of AIGTs could have profound implications for public opinion, such as spreading misinformation and manipulating narratives. Despite its importance, it remains unclear how prevalent AIGTs are on social media. To address this gap, this paper aims to quantify and monitor the AIGTs on online social media platforms. We first collect a dataset (SM-D) with around 2.4M posts from 3 major social media platforms: Medium, Quora, and Reddit. Then, we construct a diverse dataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines popular open-source datasets and our AIGT datasets generated from social media texts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors. With this setup, we identify the best-performing detector (OSM-Det). We then apply OSM-Det to SM-D to track AIGTs across social media platforms from January 2022 to October 2024, using the AI Attribution Rate (AAR) as the metric. Specifically, Medium and Quora exhibit marked increases in AAR, rising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast, Reddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the same period. Our further analysis indicates that AIGTs on social media differ from human-written texts across several dimensions, including linguistic patterns, topic distributions, engagement levels, and the follower distribution of authors. We envision our analysis and findings on AIGTs in social media can shed light on future research in this domain.

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang | Yicong Tan | Yun Shen | Ahmed Salem | Michael Backes | Savvas Zannettou | Yang Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recently, autonomous agents built on large language models (LLMs) have experienced significant development and are being deployed in real-world applications. Through the usage of tools, these systems can perform actions in the real world. Given the agents’ practical applications and ability to execute consequential actions, such autonomous systems can cause more severe damage than a standalone LLM if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination defense methods. Our findings indicate these attacks are more difficult to detect compared to previous overtly harmful attacks, highlighting the substantial risks associated with this vulnerability.

2024

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
Junjie Chu | Zeyang Sha | Michael Backes | Yang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Significant advancements have recently been made in large language models, represented by GPT models.Users frequently have multi-round private conversations with cloud-hosted GPT models for task optimization.Yet, this operational paradigm introduces additional attack surfaces, particularly in custom GPTs and hijacked chat sessions.In this paper, we introduce a straightforward yet potent Conversation Reconstruction Attack.This attack targets the contents of previous conversations between GPT models and benign users, i.e., the benign users’ input contents during their interaction with GPT models.The adversary could induce GPT models to leak such contents by querying them with designed malicious prompts.Our comprehensive examination of privacy risks during the interactions with GPT models under this attack reveals GPT-4’s considerable resilience.We present two advanced attacks targeting improved reconstruction of past conversations, demonstrating significant privacy leakage across all models under these advanced techniques.Evaluating various defense mechanisms, we find them ineffective against these attacks.Our findings highlight the ease with which privacy can be compromised in interactions with GPT models, urging the community to safeguard against potential abuses of these models’ capabilities.

Composite Backdoor Attacks Against Large Language Models
Hai Huang | Zhengyu Zhao | Michael Backes | Yun Shen | Yang Zhang
Findings of the Association for Computational Linguistics: NAACL 2024

Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with 3% poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a 100% Attack Success Rate (ASR) with a False Triggered Rate (FTR) below 2.06% and negligible model accuracy degradation. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs.

The Death and Life of Great Prompts: Analyzing the Evolution of LLM Prompts from the Structural Perspective
Yihan Ma | Xinyue Shen | Yixin Wu | Boyang Zhang | Michael Backes | Yang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Effective utilization of large language models (LLMs), such as ChatGPT, relies on the quality of input prompts. This paper explores prompt engineering, specifically focusing on the disparity between experimentally designed prompts and real-world “in-the-wild” prompts. We analyze 10,538 in-the-wild prompts collected from various platforms and develop a framework that decomposes the prompts into eight key components. Our analysis shows that and Requirement are the most prevalent two components. Roles specified in the prompts, along with their capabilities, have become increasingly varied over time, signifying a broader range of application scenarios for LLMs. However, from the response of GPT-4, there is a marginal improvement with a specified role, whereas leveraging less prevalent components such as Capability and Demonstration can result in a more satisfying response. Overall, our work sheds light on the essential components of in-the-wild prompts and the effectiveness of these components on the broader landscape of LLM prompt engineering, providing valuable guidelines for the LLM community to optimize high-quality prompts.

ModSCAN: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities
Yukun Jiang | Zheng Li | Xinyue Shen | Yugeng Liu | Michael Backes | Yang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Co-authors

Savvas Zannettou 1

Zongmin Zhang 1

Venues