Min Cai

2026

Model editing-based jailbreak backdoor attacks against LLMs have gained attention for being lightweight, enabling vulnerability discovery in LLMs. Existing methods are implemented by binding backdoors to predefined phrases as first few output tokens, inducing the LLM’s next-token prediction to produce continuous responses. However, their effectiveness is heavily dependent on the number of bound phrases, with attack costs rising as this number increases. In this work, we propose JEST, which achieves jailbreak backdoor attacks by hijacking LLM representations into a acceptance domain rather than binding to a few output tokens. Specifically, we propose a representation transition-guided model editing to inject jailbreak backdoors into LLMs. The activated backdoor transitions the LLM from rejection domain to acceptance domain, causing it to accept and generate jailbreak behavior. To clearly distinguish between rejection and acceptance domains within LLMs, we also design a domain modeling strategy for JEST that models these two opposing domains within the representation space. Additionally, JEST-hijacked LLMs exhibit greater vulnerability to direct prompt attacks. Experimental results show that JEST outperforms existing model editing methods, demonstrating stronger jailbreak capabilities across various LLMs and datasets. We also provide analysis to explore the safety boundary of LLM.

pdf bib abs

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Models (LLMs) in data science. Unlike existing benchmarks limited to single task, simple evaluation metrics, and readily available ground truth (GT), DataSciBench is built on curated, natural, and challenging prompts with complex evaluation criteria and uncertain GT. To bridge the gap, we develop a semi-automated GT generation pipeline, integrating LLM-based self-consistency and human verification to ensure accuracy, predefined task types, and aggregate functions (metrics). Furthermore, we introduce an innovative Intention-Function-Code (IFC) framework, assessing code execution outcomes through metrics and programmatic rules. Evaluating 26 models (8 API-based, 8 open-source general, 9 code generation, and 1 agentic models), our approach offers rigorous insights into LLM strengths and weaknesses. Experimental results show API-based models outperform open-source counterparts across all metrics, with DeepAnalyze-8B leading among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

Co-authors

Venues

Findings2

Fix author