SubmissionNumber#=%=#33 FinalPaperTitle#=%=#Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards ShortPaperTitle#=%=# NumberOfPages#=%=#9 CopyrightSigned#=%=#Panuthep Tasawong JobTitle#==#PhD Student Organization#==#Vidyasirimedhi Institute of Science and Technology (VISTEC) 555 Moo 1, Pa Yup Nai, Wang Chan District, Rayong 21210, Thailand Abstract#==#This paper investigates the problem of shortcut learning in safety guardrails for large language models (LLMs). It reveals that current safeguard models often rely excessively on superficial cues, such as specific keywords that are spuriously correlated with training labels, rather than genuinely understanding the input's semantics or intent. As a result, their performance degrades significantly when there is a shift in keyword distribution. The paper also examines the impact of reducing shortcut reliance, showing that merely minimizing shortcut influence is insufficient. To build robust safeguard models, it is equally crucial to promote the use of intended features. Author{1}{Firstname}#=%=#Panuthep Author{1}{Lastname}#=%=#Tasawong Author{1}{Username}#=%=#panuthept Author{1}{Email}#=%=#panuthep.t_s20@vistec.ac.th Author{1}{Affiliation}#=%=#School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Thailand Author{2}{Firstname}#=%=#Napat Author{2}{Lastname}#=%=#Laosaengpha Author{2}{Email}#=%=#6470375421@student.chula.ac.th Author{2}{Affiliation}#=%=#Chulalongkorn University Author{3}{Firstname}#=%=#Wuttikorn Author{3}{Lastname}#=%=#Ponwitayarat Author{3}{Username}#=%=#wuttikornp.pro-2022_ Author{3}{Email}#=%=#wuttikornp_pro@vistec.ac.th Author{3}{Affiliation}#=%=#VISTEC Author{4}{Firstname}#=%=#Sitiporn Sae Author{4}{Lastname}#=%=#Lim Author{4}{Email}#=%=#Sitiporn.s@vistec.ac.th Author{4}{Affiliation}#=%=#School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Thailand Author{5}{Firstname}#=%=#Potsawee Author{5}{Lastname}#=%=#Manakul Author{5}{Username}#=%=#potsawee Author{5}{Email}#=%=#pm574@cam.ac.uk Author{5}{Affiliation}#=%=#University of Cambridge Author{6}{Firstname}#=%=#Samuel Author{6}{Lastname}#=%=#Cahyawijaya Author{6}{Username}#=%=#samuelc Author{6}{Email}#=%=#samuelcahyawijaya@cohere.com Author{6}{Affiliation}#=%=#Cohere Author{7}{Firstname}#=%=#Can Author{7}{Lastname}#=%=#Udomcharoenchaikit Author{7}{Email}#=%=#canu_pro@vistec.ac.th Author{7}{Affiliation}#=%=#School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Thailand Author{8}{Firstname}#=%=#Peerat Author{8}{Lastname}#=%=#Limkonchotiwat Author{8}{Email}#=%=#peerat@aisingapore.org Author{8}{Affiliation}#=%=#AI Singapore Author{9}{Firstname}#=%=#Ekapol Author{9}{Lastname}#=%=#Chuangsuwanich Author{9}{Username}#=%=#ekapolc Author{9}{Email}#=%=#ekapolc@cp.eng.chula.ac.th Author{9}{Affiliation}#=%=#Chulalongkorn University Author{10}{Firstname}#=%=#Sarana Author{10}{Lastname}#=%=#Nutanong Author{10}{Username}#=%=#snutanong Author{10}{Email}#=%=#snutanon@vistec.ac.th Author{10}{Affiliation}#=%=#School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology, Thailand ========== èéáğö