Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy
Abstract
Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed “aligned” can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior, which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update–robust alignment evaluation- Anthology ID:
- 2026.trustnlp-main.10
- Volume:
- Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California
- Editors:
- Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
- Venues:
- TrustNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 180–203
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.10/
- DOI:
- Cite (ACL):
- Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, and Sai Praneeth Karimireddy. 2026. Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 180–203, San Diego, California. Association for Computational Linguistics.
- Cite (Informal):
- Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment (Bakman et al., TrustNLP 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.10.pdf