Thinking Twice Makes Large Language Models Safer and More Helpful

Yutao Mou; Yuxiao Luo; Shikun Zhang; Wei Ye

Thinking Twice Makes Large Language Models Safer and More Helpful

Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye

Abstract

Current safety alignment techniques for large language models (LLMs) struggle to balance harmlessness and helpfulness: improving safety often comes at the cost of degraded utility. Our preliminary study shows that guiding unaligned base models with safety-aware reasoning that includes explicit self-reflection can effectively defend jailbreak attacks while preserving response quality. This observation motivates internalizing and strengthening self-reflective reasoning capabilities within LLMs to achieve a better safety–utility trade-off. We propose Safety-aware Reflective Reasoning Optimization (SaRO), a two-stage framework: (1) Reasoning-style Warmup (RW) to internalize self-reflective reasoning, and (2) Self-reflective Reasoning Process Optimization (SRPO) to encourage reflection and correction. Experiments show that SaRO outperforms existing reasoning-based alignment methods, achieving a better balance of safety and helpfulness.

Anthology ID:: 2026.findings-acl.1812
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36365–36389
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1812/
DOI:
Bibkey:
Cite (ACL):: Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. 2026. Thinking Twice Makes Large Language Models Safer and More Helpful. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36365–36389, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Thinking Twice Makes Large Language Models Safer and More Helpful (Mou et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1812.pdf
Checklist:: 2026.findings-acl.1812.checklist.pdf

PDF Cite Search Checklist Fix data