XiaoFeng Wang
Other people with similar names: Xiaofeng Wang, Xiaofeng Wang
Unverified author pages with similar names: Xiaofeng Wang
2026
ReasMark: A Robust Watermark for Attributing LLM Reasoning Under Knowledge Distillation Attacks
Peizhuo Lv | Ruihua Zhou | Yunpeng Li | Ruigang Liang | Xingshuo Han | XiaoFeng Wang | Wei Dong | Yuling Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Peizhuo Lv | Ruihua Zhou | Yunpeng Li | Ruigang Liang | Xingshuo Han | XiaoFeng Wang | Wei Dong | Yuling Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reasoning-enhanced large language models rely on intermediate reasoning signals to solve complex, multi-step tasks, making reasoning behavior a valuable form of intellectual property. Meanwhile, knowledge distillation enables an adversary to replicate this behavior in a realistic black-box setting by repeatedly querying a deployed model on a target domain and training a local student to imitate its outputs, including reasoning traces. Existing LLM watermarks primarily operate on surface text and decoding-time token biases, and thus fail to provide reliable attribution of reasoning behavior once it is transferred through knowledge distillation. ReasMark entangles the watermark with the target-domain input distribution by selecting watermark tokens from high-frequency prompts, so distillation queries naturally activate it. It then embeds the watermark by score-conditioned losses that create a detectable reasoning-length gap for black-box verification. Comprehensive experiments across multiple LLMs, datasets, and distillation settings demonstrate that ReasMark consistently outperforms existing baselines while preserving task utility.
Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, the EBM maps the LLM’s internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model’s core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness: raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.