Bogdan Dura

2026

ICI Innolabs at SemEval-2026 Task 13: Sliding Windows Meet Code Transformers
Sebastian Balmus | Bogdan Dura
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We describe our system for SemEval-2026 Task 13, Subtask B, which focuses on multi-class authorship attribution for code: given a code snippet, the goal is to predict whether it is human-written or generated by one of ten LLM families. The task presents two central challenges: severe class imbalance and long input sequences that frequently exceed the context length of encoder-based Transformers. To address these issues, we adopt a window-based fine-tuning and inference framework. During training, we randomly sample 512-token windows from each snippet and optimize a class-weighted cross-entropy objective with label smoothing. At inference time, we apply a sliding-window strategy and aggregate window-level logits to obtain a snippet-level prediction. We fine-tune three pretrained code encoders (CodeBERT, UniXcoder, and StarEncoder) under this framework and combine their outputs via majority voting. On the official validation split, our best single model (StarEncoder) achieves 0.60 macro F1. On the final test set, the three-model ensemble reaches 0.41 macro F1, ranking 10th on the leaderboard. Our results demonstrate that window-based modeling combined with imbalance-aware optimization provides a robust and reproducible baseline for multi-class LLM attribution under distribution shift.

Co-authors

Sebastian Balmus 1

Venues

SemEval1
WS1

Fix author