Ziqing Yang
Other people with similar names: Ziqing Yang
Unverified author pages with similar names: Ziqing Yang
2026
Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang | Yixin Wu | Rui Wen | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Ziqing Yang | Yixin Wu | Rui Wen | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026
With the rapid adoption of large language models (LLMs), conversational AI agents have become widely deployed across real-world applications. To enhance safety, these agents are often equipped with guardrails that moderate harmful content. Identifying the guardrails in an agent thus becomes critical for adversaries to understand the system and design guard-specific attacks. In this work, we introduce AP-Test, a novel approach that leverages guard-specific adversarial prompts to detect the identity of guardrails deployed in black-box AI agents. Our method addresses key challenges in this task, including the influence of safety-aligned LLMs and other guardrails, as well as a lack of principled decision-making strategies. AP-Test employs two complementary testing strategies, input and output guard tests, and a new metric, match score, to enable robust identification. Experiments across diverse agents and four open-source guardrails demonstrate that AP-Test achieves perfect classification accuracy in multiple scenarios. Ablation studies further highlight the necessity of our proposed components. Our findings reveal a practical path toward guardrail identification in real-world AI systems.
PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality
Zeyuan Chen | Ziqing Yang | Yihan Ma | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Zeyuan Chen | Ziqing Yang | Yihan Ma | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026
As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness.A trend of using large language models (LLMs) for assistance has emerged.In this work, we take a critical step toward improving the quality of LLM-generated reviews.We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to increase LLM-human similarity (RQ2).We first analyzed the human-written reviews with reviews generated by GPT-4o, Claude-3.7-Sonnet, and DeepSeek-V3 and found that LLMs and humans focus on different terms, e.g., LLMs prioritize theory while humans emphasize methodology and experiments.We further adopt prompt engineering, such as Chain-of-Thought (CoT), and utilize retrieval-augmented generation (RAG) to enhance the LLM-generated reviews towards human-level quality.We find CoT significantly improves the human similarity of LLM reviews, while we also discover an unexpected “RAG paradox,” i.e., experiments with RAG produce different results for various LLMs and, in some cases, even reduce review quality.Our comprehensive analysis of LLM-generated academic reviews illustrates both possibilities and limitations, contributing to a more effective, human-aligned review system.
2025
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu | Yugeng Liu | Ziqing Yang | Xinyue Shen | Michael Backes | Yang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junjie Chu | Yugeng Liu | Ziqing Yang | Xinyue Shen | Michael Backes | Yang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jailbreak attacks aim to bypass the LLMs’ safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation—either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks, which could achieve high attack success rates but are easy to mitigate by defenses. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.