Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution

Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, Sujit Kumar


Abstract
The rise of code-generating LLMs such as DeepSeek, Qwen, and Meta-LLaMA has improved developer productivity but also increased risks of plagiarism, copyright misuse, and insecure machine-generated code. While AI-text detection is well studied, machine-generated source-code detection especially across multiple languages, LLM families, and OOD conditions-remains underexplored. SemEval-2026 Task 13 addresses this via two subtasks: (A) binary human–machine code detection and (B) multi-class authorship attribution across ten LLM families. For Subtask A, we fine-tune RoBERTa, CodeBERT, GraphCodeBERT, and StarCoderBase-1B, introducing a stratified sampling strategy with class-weighted loss to mitigate imbalance and OOD shifts. For Subtask B, we mitigate the extreme human-class imbalance using undersampling, inverse-frequency weights, syntactic noising, and curriculum-based dual-path training with TinyStarCoderPy and CodeBERT. Both results show that long-context modeling, distribution-aware sampling, and noise-robust training are crucial for reliable in real-world settings. Overall, long-context modeling, distribution-aligned sampling, and lightweight noise-robust training emerge as key factors for reliable machine-generated code detection and authorship attribution.
Anthology ID:
2026.semeval-1.360
Volume:
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Ekaterina Kochmar, Debanjan Ghosh, Kai North, Mamoru Komachi
Venues:
SemEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2866–2876
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.360/
DOI:
Bibkey:
Cite (ACL):
Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, and Sujit Kumar. 2026. Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution. In Proceedings of the 20th International Workshop on Semantic Evaluation (2026), pages 2866–2876, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution (Agrahari et al., SemEval 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.360.pdf