CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings

Daniil Orel, Dilshod Azizov, Preslav Nakov


Abstract
Large Language Models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, this has had important consequences for programming skills, ethics, and assessment integrity, thus making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some previous research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. Here, we aim to bridge this gap. In particular, we propose a framework capable of distinguishing between human-written and LLM-generated program code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Our extensive experiments show that our framework effectively distinguishes human-written from LLM-generated program code, setting a new benchmark for the task.
Anthology ID:
2025.findings-acl.550
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10570–10593
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.550/
DOI:
Bibkey:
Cite (ACL):
Daniil Orel, Dilshod Azizov, and Preslav Nakov. 2025. CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10570–10593, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings (Orel et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.550.pdf