Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, Xiangliang Zhang
Abstract
Hard-gated safety checkers often over-refuse and misalign with a vendor’s model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed—a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label–explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2–10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.- Anthology ID:
- 2026.findings-acl.292
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5878–5900
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.292/
- DOI:
- Cite (ACL):
- Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, and Xiangliang Zhang. 2026. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 5878–5900, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs (Huang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.292.pdf