Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution
Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, Sujit Kumar
Abstract
The rise of code-generating LLMs such as DeepSeek, Qwen, and Meta-LLaMA has improved developer productivity but also increased risks of plagiarism, copyright misuse, and insecure machine-generated code. While AI-text detection is well studied, machine-generated source-code detection especially across multiple languages, LLM families, and OOD conditions-remains underexplored. SemEval-2026 Task 13 addresses this via two subtasks: (A) binary human–machine code detection and (B) multi-class authorship attribution across ten LLM families. For Subtask A, we fine-tune RoBERTa, CodeBERT, GraphCodeBERT, and StarCoderBase-1B, introducing a stratified sampling strategy with class-weighted loss to mitigate imbalance and OOD shifts. For Subtask B, we mitigate the extreme human-class imbalance using undersampling, inverse-frequency weights, syntactic noising, and curriculum-based dual-path training with TinyStarCoderPy and CodeBERT. Both results show that long-context modeling, distribution-aware sampling, and noise-robust training are crucial for reliable in real-world settings. Overall, long-context modeling, distribution-aligned sampling, and lightweight noise-robust training emerge as key factors for reliable machine-generated code detection and authorship attribution.- Anthology ID:
- 2026.semeval-1.360
- Volume:
- Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Ekaterina Kochmar, Debanjan Ghosh, Kai North, Mamoru Komachi
- Venues:
- SemEval | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2866–2876
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.360/
- DOI:
- Cite (ACL):
- Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, and Sujit Kumar. 2026. Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution. In Proceedings of the 20th International Workshop on Semantic Evaluation (2026), pages 2866–2876, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution (Agrahari et al., SemEval 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.360.pdf