Atakan Site
2026
Codexa at SemEval-2026 Task 13: Loss Engineering and Diverse Ensemble Strategies for Multi-Class Code Authorship Attribution
Anıl Dervişoğlu | Atakan Site
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Anıl Dervişoğlu | Atakan Site
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We describe our system for SemEval-2026 Task 13, Subtask B: code classification into 11 categories (human-written or generated by one of 10 LLM families). The task presents extreme class imbalance and distribution shift across multiple generators provided in the dataset (31 in training, 59 in test, with 36 unseen). On that focus, we approached with two components: (1) UniXcoder as the encoder with Label-Distribution-Aware Margin (LDAM) loss for handling class imbalance, which provides a +7% absolute improvement over the cross-entropy baseline; and (2) a diverse ensemble of 12 models trained with different objectives and architectures which is detailed in the appendix, combined with hard voting. Our system achieves 41.28% Macro F1 on the official test set. We find that loss engineering and ensemble diversity matter more than domain adaptation techniques, which consistently degraded test performance.
ITUNLP at MWE-2026 AdMIRe 2: A Zero-Shot LLM Pipeline for Multimodal Idiom Understanding and Ranking
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper presents our system for AdMIRe 2 (Advancing Multimodal Idiomaticity Representation), a shared task on multilingual multimodal idiom understanding. The task focuses on ranking images according to how well they depict the literal or idiomatic usage of potentially idiomatic expressions (PIEs) in context, across 15 languages and two tracks: a text-only track, and a multimodal track that uses both images and captions. To tackle both tracks, we propose a hybrid zero-shot pipeline built on large vision–language models (LVLMs). Our system employs a chain-of-thought prompting scheme that first classifies each PIE usage as literal or idiomatic and then ranks candidate images by their alignment with the inferred meaning.A primary–fallback routing mechanism increases robustness to safety-filter refusals, while lightweight post-processing recovers consistent rankings from imperfect model outputs.Without any task-specific fine-tuning, our approach achieves 55.9% Top-1 Accuracy in the text-only track and 60.1% in the multimodal (text+image) track, ranking first overall on the official leaderboard. These results suggest that carefully designed zero-shot LVLM pipelines can provide strong baselines for multilingual multimodal idiomaticity benchmarks.
2025
ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation
Atakan Site | Emre Erdemir | Gülşen Eryiğit
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Atakan Site | Emre Erdemir | Gülşen Eryiğit
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answeringover Tabular Data. The primary objective ofthis task is to perform question answering ongiven tabular datasets from diverse domains;under two subtasks: DataBench QA (SubtaskI) and DataBench Lite QA (Subtask II). Totackle both subtasks, we developed a zero-shotsolution with a particular emphasis on lever-aging Large Language Model (LLM)-basedcode generation. Specifically, we proposeda Python code generation framework, utiliz-ing state-of-the-art open-source LLMs to gen-erate executable Pandas code via optimizedprompting strategies. Our experiments revealthat different LLMs exhibit varying levels ofeffectiveness in Python code generation. Addi-tionaly, results show that Python code genera-tion achieves superior performance in tabularquestion answering compared to alternative ap-proaches. Although our ranking among zero-shot systems is unknown at the time of this pa-per’s submission, our system achieved eighthplace in Subtask I and sixth place in Subtask IIamong the 30 systems that outperformed thebaseline in the open-source models category.