Zhuohan Long
2025
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Siyuan Wang
|
Zhuohan Long
|
Zhihao Fan
|
Xuanjing Huang
|
Zhongyu Wei
Proceedings of the 31st International Conference on Computational Linguistics
This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs). We utilize a multi-agent system to reframe new evolving instances with high confidence that extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, shortcut biases and probing their problem-solving sub-abilities. With this framework, we extend datasets across general and specific tasks, through various iterations. Experimental results show a performance decline in most LLMs against their original results under scalable and robust evaluations, offering a more accurate reflection of model capabilities alongside our fine-grained evaluation. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks. We hope this framework contributes the research community for continuously evolving benchmarks alongside LLM development.
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
Zhuohan Long
|
Siyuan Wang
|
Shujun Liu
|
Yuhang Lai
Findings of the Association for Computational Linguistics: EMNLP 2025
Jailbreak attacks, where harmful prompts bypass generative models’ built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model’s ability to differentiate between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies—inter-mechanism and intra-mechanism ensembles—to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.
2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
|
Zhuohan Long
|
Zhihao Fan
|
Zhongyu Wei
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
Search
Fix author
Co-authors
- Siyuan Wang (王思远) 3
- Zhihao Fan 2
- Zhongyu Wei (魏忠钰) 2
- Xuan-Jing Huang (黄萱菁) 1
- Yuhang Lai 1
- show all...