Domain Generalizable AI Guardrails with Augmented Policy Training

Minqian Liu; Ioana Baldini; David Rabinowitz; David S Rosenberg; Sebastian Gehrmann; Mark Dredze

Domain Generalizable AI Guardrails with Augmented Policy Training

Minqian Liu, Ioana Baldini, David Rabinowitz, David S Rosenberg, Sebastian Gehrmann, Mark Dredze

Abstract

AI guardrail systems support usage policies by determining whether a user query or a generated response is allowed or forbidden under the policy. Fine-tuned guardrails – such as LlamaGuard and ShieldGemma – include policy definitions in prompts during training that can be updated during inference to aid generalization. However, our analysis reveals that these models still overfit the training policies, which prevents adaptation to new domains. We propose Augmented Policy Training (APT), a training recipe that enhances guardrail adaptability to unseen policies by using a suite of policy perturbation strategies during training to reduce overfitting and increase generalization. Notably, a small 1B model trained in this manner achieves comparable or better performance than existing 8B guardrails on unseen policies. Our work reveals critical limitations of existing AI guardrails, offers a promising solution, and provides actionable insights for adapting systems to new domains and policies.

Anthology ID:: 2026.acl-long.748
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16452–16469
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.748/
DOI:
Bibkey:
Cite (ACL):: Minqian Liu, Ioana Baldini, David Rabinowitz, David S Rosenberg, Sebastian Gehrmann, and Mark Dredze. 2026. Domain Generalizable AI Guardrails with Augmented Policy Training. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16452–16469, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Domain Generalizable AI Guardrails with Augmented Policy Training (Liu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.748.pdf
Checklist:: 2026.acl-long.748.checklist.pdf

PDF Cite Search Checklist Fix data