Yichi Zhang
Other people with similar names: Yichi Zhang , Yichi Zhang
2025
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Yao Huang
|
Yitong Sun
|
Shouwei Ruan
|
Yichi Zhang
|
Yinpeng Dong
|
Xingxing Wei
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
Siyuan Zhang
|
Yichi Zhang
|
Yinpeng Dong
|
Hang Su
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in other different capabilities. In this paper, we propose to address these by directly augmenting LLM’s fundamental ability to precisely leverage its knowledge and introduce PKUE (Precise Knowledge Utilization Enhancement), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in different language.
Search
Fix author
Co-authors
- Yinpeng Dong 2
- Yao Huang 1
- Shouwei Ruan 1
- Hang Su 1
- Yitong Sun 1
- show all...