Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications

Traian Rebedea; Leon Derczynski; Shaona Ghosh; Makesh Narsimhan Sreedhar; Faeze Brahman; Liwei Jiang; Bo Li; Yulia Tsvetkov; Christopher Parisien; Yejin Choi

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications

Traian Rebedea, Leon Derczynski, Shaona Ghosh, Makesh Narsimhan Sreedhar, Faeze Brahman, Liwei Jiang, Bo Li, Yulia Tsvetkov, Christopher Parisien, Yejin Choi

Abstract

Pretrained generative models, especially large language models, provide novel ways for users to interact with computers. While generative NLP research and applications had previously aimed at very domain-specific or task-specific solutions, current LLMs and applications (e.g. dialogue systems, agents) are versatile across many tasks and domains. Despite being trained to be helpful and aligned with human preferences (e.g., harmlessness), enforcing robust guardrails on LLMs remains a challenge. And, even when protected against rudimentary attacks, just like other complex software, LLMs can be vulnerable to attacks using sophisticated adversarial inputs. This tutorial provides a comprehensive overview of key guardrail mechanisms developed for LLMs, along with evaluation methodologies and a detailed security assessment protocol - including auto red-teaming of LLM-powered applications. Our aim is to move beyond the discussion of single prompt attacks and evaluation frameworks towards addressing how guardrailing can be done in complex dialogue systems that employ LLMs.

Anthology ID:: 2025.acl-tutorials.8
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Yuki Arase, David Jurgens, Fei Xia
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13–15
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-tutorials.8/
DOI:
Bibkey:
Cite (ACL):: Traian Rebedea, Leon Derczynski, Shaona Ghosh, Makesh Narsimhan Sreedhar, Faeze Brahman, Liwei Jiang, Bo Li, Yulia Tsvetkov, Christopher Parisien, and Yejin Choi. 2025. Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), pages 13–15, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications (Rebedea et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-tutorials.8.pdf

PDF Cite Search Fix data