Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought Alignment

Guan Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu


Abstract
To address the increasingly severe safety risk of large language models (LLMs), reasoning-based safety alignment methods have emerged. These methods overcome the limitations of ’shallow alignment’ by exposing the model’s Chain-of-Thought (CoT), enabling auditability of safety reasoning process through both training-phase supervision and post-generation verification. However, this transparency creates a critical vulnerability, a tension we define as the Security Auditability Dilemma: while explicit reasoning is a prerequisite for safety, its textual Auditable paradoxically transforms it into an optimization target for adaptive attackers and induces the model to unintentionally copy harmful content from its own reasoning context. To address this, we propose Auditable Latent CoT Alignment (ALCA), a framework that decouples internal reasoning from external output. ALCA shifts the safety deliberation process into a continuous latent space. This allows the safety reasoning process to guide the generation of harmless outputs, while eliminates the discrete textual surface that facilitates internal copying and adaptive attack. Yet, this process is not a black box. we introduce a restricted Self-Decoding mechanism that allows the model to reconstruct its latent reasoning into human-readable text for supervision under specific guidance. Extensive experiments show that ALCA achieves robustness alignment, reducing the success rate of adaptive jailbreak attacks by over 40% compared to strong baselines, while preserving performance. Our framework presents a path toward building LLMs that are both robustly secure and auditable.
Anthology ID:
2026.acl-long.1570
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34051–34067
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1570/
DOI:
Bibkey:
Cite (ACL):
Guan Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. 2026. Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought Alignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34051–34067, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought Alignment (Wang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1570.pdf
Checklist:
 2026.acl-long.1570.checklist.pdf