Localizing Malicious Outputs from CodeLLM

Mayukh Borana; Junyi Liang; Sai Sathiesh Rajan; Sudipta Chattopadhyay

doi:10.18653/v1/2025.findings-emnlp.1041

Localizing Malicious Outputs from CodeLLM

Mayukh Borana, Junyi Liang, Sai Sathiesh Rajan, Sudipta Chattopadhyay

Abstract

We introduce FreqRank, a mutation-based defense to localize malicious components in LLM outputs and their corresponding backdoor triggers. FreqRank assumes that the malicious sub-string(s) consistently appear in outputs for triggered inputs and uses a frequency-based ranking system to identify them. Our ranking system then leverages this knowledge to localize the backdoor triggers present in the inputs. We create nine malicious models through fine-tuning or custom instructions for three downstream tasks, namely, code completion (CC), code generation (CG), and code summarization (CS), and show that they have an average attack success rate (ASR) of 86.6%. Furthermore, FreqRank’s ranking system highlights the malicious outputs as one of the top five suggestions in 98% of cases. We also demonstrate that FreqRank’s effectiveness scales as the number of mutants increases and show that FreqRank is capable of localizing the backdoor trigger effectively even with a limited number of triggered samples. Finally, we show that our approach is 35-50% more effective than other defense methods.

Anthology ID:: 2025.findings-emnlp.1041
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19132–19143
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1041/
DOI:: 10.18653/v1/2025.findings-emnlp.1041
Bibkey:
Cite (ACL):: Mayukh Borana, Junyi Liang, Sai Sathiesh Rajan, and Sudipta Chattopadhyay. 2025. Localizing Malicious Outputs from CodeLLM. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19132–19143, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Localizing Malicious Outputs from CodeLLM (Borana et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1041.pdf
Checklist:: 2025.findings-emnlp.1041.checklist.pdf

PDF Cite Search Checklist Fix data