Zhao Tong

2026

While prompt engineering enhances the capabilities of Large Language Models (LLMs), it also exposes critical safety concerns. Due to the inherent brittleness of their static safety boundaries, LLMs are vulnerable to jailbreak prompts, i.e. adversarial inputs designed to bypass safeguards and induce the generation of harmful content. Existing detection mechanisms rely on static model components or fixed decision thresholds, limiting their ability to generalize to evolving attack patterns and continual model updates. To bridge this gap, we propose RLShield, a dynamic jailbreak detection framework that employs reinforcement learning for adaptive threshold selection. RLShield incorporates three key innovations: (i) a dynamic retrieval and LLM-based rewriting module to simulate diverse adversarial contexts; (ii) a cross-layer representation analysis to pinpoint safety-critical parameters; and (iii) a Soft Actor-Critic (SAC) based agent that learns to predict optimal, sample-specific detection thresholds. Experimental results demonstrate that RLShield consistently outperforms state-of-the-art baselines in detection performance while maintaining high computational efficiency. Notably, it improves F1 by up to 7.3%, while achieving an average of 3× gain in inference efficiency across multiple LLM backbones.

2025

pdf bib abs

The spread of fake news on online platforms has long been a pressing concern. Considering this, extensive efforts have been made to develop fake news detectors. However, a major drawback of these models is their relatively low performance—lagging by more than 20%—in identifying fake news compared to real news, making them less suitable for practical deployment. This gap is likely due to an imbalance in the dataset and the model’s inadequate understanding of data distribution on the targeted platform. In this work, we focus on improving the model’s effectiveness in detecting fake news. To achieve this, we first adopt an LLM to generate fake news in three different styles, which are later incorporated into the training set to augment the representation of fake news. Then, we apply Reinforcement Learning to dynamically sample fake news, allowing the model to learn the optimal real-to-fake news ratio for training an effective fake news detector on the targeted platform. This approach allows our model to perform effectively even with a limited amount of annotated news data and consistently improve detection accuracy across different platforms. Experimental results demonstrate that our approach achieves state-of-the-art performance on two benchmark datasets, improving fake news detection performance by 24.02% and 11.06% respectively.

Co-authors

Huidong Liu 1

Xingcheng Xu 1

Pengfei Yang 1

Venues

ACL1
Findings1

Fix author