Raja Khurram Shahzad

2026

MIUN BiasPatrol at SemEval-2026 Task 13: Why TF-IDF can Beat Transformers for OOD Code Detection
Loviza Sahlen | Thomas Springfeldt | Mehwish Fatima | Raja Khurram Shahzad
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

The increasing use of AI-generated code underscores the need for effective detection systems. However, their performance often deteriorates when faced with distribution shifts. This paper presents our system for SemEval-2026 Task 13: A, which focuses on binary classification of human-written versus machine-generated code across various programming languages and domains. We systematically compare traditional classifiers, such as Random Forest and XGBoost, which utilize statistical and TF-IDF features, against pipelines that leverage frozen embeddings from advanced code transformers like UniXcoder and GraphCodeBERT. Our results reveal a notable trade-off, i.e., while transformer-based pipelines excel in in-distribution validation (reaching up to 0.89 Macro F1), they experience severe performance drops up to 77%; when applied to out-of-distribution languages and domains. In contrast, models employing TF-IDF with tree-based classifiers demonstrate significantly greater stability. We identify this fragility as a result of a bias toward superficial formatting, particularly whitespace, which is accentuated by transformers. By implementing simple space normalization, we enhance the out-of-distribution robustness of traditional models; however, this also highlights the ongoing dependence of embeddings on these non-semantic features. Our findings suggest that for creating generalizable code detection systems, straightforward, well-normalized lexical features may be more reliable than complex, unrefined embeddings.

2025

pdf bib abs

This paper presents AfroEmo, a multilingual, multi label emotion classification system designed for SemEval 2025 Task 11, leveraging the Afro XLMR model. Our approach integrates adaptive pretraining on domain specific corpora followed by fine tuning on low resource languages. Through comprehensive exploratory data analysis, we assess label distribution and model performance across diverse linguistic settings. By incorporating perceived emotions, how emotions are interpreted rather than explicitly stated, we enhance emotion recognition capabilities in underrepresented languages. Experimental results demonstrate that our method achieves competitive performance particularly in Amharic, while addressing key challenges in low resource emotion detection.

Co-authors

Laiba Rana 1

Loviza Sahlen 1

Thomas Springfeldt 1

Venues

SemEval2
WS2

Fix author