Xingjun Wang

2026

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
Yuexiao Liu | Lijun Li | Xingjun Wang | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have gained significant attention due to their objective and verifiable reward signals, demonstrating strong performance in reasoning and code generation tasks. However, the potential safety risks associated with RLVR remain underexplored. This paper presents HarmRLVR, the first systematic investigation into the alignment reversibility risk of RLVR. We show that safety alignment can be rapidly reversed using GRPO with merely 64 harmful prompts without responses, causing models to readily comply with harmful instructions. Across five models from Llama, Qwen, and DeepSeek, we empirically demonstrate that RLVR-based attacks elevate the average harmfulness score to 4.94 with an attack success rate of 96.01%, significantly outperforming harmful fine-tuning while preserving general capabilities. Our findings reveal that RLVR can be efficiently exploited for harmful alignment, posing serious threats to open-source model safety.

2022

pdf bib abs

HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing
Yanzhao Zheng | Haibin Wang | Baohua Dong | Xingjun Wang | Changshan Li
Findings of the Association for Computational Linguistics: ACL 2022

Recently, context-dependent text-to-SQL semantic parsing which translates natural language into SQL in an interaction process has attracted a lot of attentions. Previous works leverage context dependence information either from interaction history utterances or previous predicted queries but fail in taking advantage of both of them since of the mismatch between the natural language and logic-form SQL. In this work, we propose a History Information Enhanced text-to-SQL model (HIE-SQL) to exploit context dependence information from both history utterances and the last predicted SQL query. In view of the mismatch, we treat natural language and SQL as two modalities and propose a bimodal pre-trained model to bridge the gap between them. Besides, we design a schema-linking graph to enhance connections from utterances and the SQL query to database schema. We show our history information enhanced methods improve the performance of HIE-SQL by a significant margin, which achieves new state-of-the-art results on two context-dependent text-to-SQL benchmarks, the SparC and CoSQL datasets, at the writing time.

Co-authors

Haibin Wang 1

Yanzhao Zheng 1

Venues

ACL1
Findings1

Fix author