Yuxiao Luo

Other people with similar names: Yuxiao Luo

Unverified author pages with similar names: Yuxiao Luo

2026

Thinking Twice Makes Large Language Models Safer and More Helpful
Yutao Mou | Yuxiao Luo | Shikun Zhang | Wei Ye
Findings of the Association for Computational Linguistics: ACL 2026

Current safety alignment techniques for large language models (LLMs) struggle to balance harmlessness and helpfulness: improving safety often comes at the cost of degraded utility. Our preliminary study shows that guiding unaligned base models with safety-aware reasoning that includes explicit self-reflection can effectively defend jailbreak attacks while preserving response quality. This observation motivates internalizing and strengthening self-reflective reasoning capabilities within LLMs to achieve a better safety–utility trade-off. We propose Safety-aware Reflective Reasoning Optimization (SaRO), a two-stage framework: (1) Reasoning-style Warmup (RW) to internalize self-reflective reasoning, and (2) Self-reflective Reasoning Process Optimization (SRPO) to encourage reflection and correction. Experiments show that SaRO outperforms existing reasoning-based alignment methods, achieving a better balance of safety and helpfulness.

Co-authors

Venues

Findings1

Fix author