Mi Zhang

Dublin

Unverified author pages with similar names: Mi Zhang

2026

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Current defense methods, however, depend on costly fine-tuning and additional expert knowledge, which limits their scalability.In this work, we propose ***ReasoningGuard***, an inference-time safeguard for LRMs.It injects timely *safety aha moments* during the reasoning process to guide the model towards harmless yet helpful reasoning.Our approach leverages the internal attention mechanisms of the LRM to accurately identify key points in the reasoning path, triggering safety-oriented reflections.To safeguard both the subsequent reasoning steps and the final answers, we implement a scaling sampling strategy during decoding to select the optimal reasoning path.With minimal additional inference cost, *ReasoningGuard* effectively mitigates four types of jailbreak attacks, including recent ones targeting the reasoning process of LRMs. Our approach outperforms nine existing safeguards, providing state-of-the-art defenses while avoiding common exaggerated safety issues.

2023

pdf bib abs

SlowBERT: Slow-down Attacks on Input-adaptive Multi-exit BERT
Shengyao Zhang | Xudong Pan | Mi Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2023

For pretrained language models such as Google’s BERT, recent research designs several input-adaptive inference mechanisms to improve the efficiency on cloud and edge devices. In this paper, we reveal a new attack surface on input-adaptive multi-exit BERT, where the adversary imperceptibly modifies the input texts to drastically increase the average inference cost. Our proposed slow-down attack called SlowBERT integrates a new rank-and-substitute adversarial text generation algorithm to efficiently search for the perturbation which maximally delays the exiting time. With no direct access to the model internals, we further devise a time-based approximation algorithm to infer the exit position as the loss oracle. Our extensive evaluation on two popular instances of multi-exit BERT for GLUE classification tasks validates the effectiveness of SlowBERT. In the worst case, SlowBERT increases the inference cost by 4.57×, which would strongly hurt the service quality of multi-exit BERT in practice, e.g., increasing the real-time cloud services’ response times for online users.

Co-authors

Min Yang 1

Xiaoyu You 1

Shengyao Zhang 1

Venues

ACL1
Findings1

Fix author