\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}

% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

% Allow long URLs to break across lines
\usepackage{xurl}

% For math equations
\usepackage{amsmath}

% For \checkmark and other math symbols
\usepackage{amssymb}

% For professional tables
\usepackage{booktabs}
\usepackage{multirow}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}


% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

\title{Team\_Omega at SemEval-2026 Task 13: Frozen vs. Trainable Representations for Out-of-Distribution AI-Generated Code Detection: A CodeBERT Fine-Tuning Study}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

\author{
  Nahid Niyaz Shovon\textsuperscript{1} \quad
  MD. Naim Parvez\textsuperscript{1} \\
  \textsuperscript{1}Rajshahi University of Engineering \& Technology \\
  \{nahidniyaz185, naimparvez999\}@gmail.com
}


\begin{document}
\maketitle
\begin{abstract}
This paper describes Team\_Omega’s submission to SemEval-2026 Task 13 (Subtask A), which focuses on detecting AI-generated code under severe cross-language and cross-domain distribution shift. We investigate whether task-specific fine-tuning improves or harms out-of-distribution (OOD) generalization by conducting a controlled comparison between two CodeBERT-based configurations: a fully frozen encoder and a partially fine-tuned model with last-layer adaptation and a deeper residual classification head. While partial fine-tuning substantially improves in-domain performance to 0.9841 macro F1 score, it leads to severe degradation under OOD evaluation, collapsing to biased predictions and achieving only 0.3026 macro F1 score. In contrast, the frozen-backbone baseline achieves lower in-domain accuracy but more stable OOD performance, reaching macro F1 score of 0.5132. Our findings highlight a critical trade-off between in-domain discrimination and cross-language robustness, suggesting that preserving pretrained multilingual representations may be preferable to aggressive task adaptation in distribution-shifted code detection settings.
\end{abstract}
\section{Introduction}

Large Language Models (LLMs) have revolutionized software engineering by automating code generation, but their proliferation raises concerns regarding academic integrity and software security. To mitigate these risks, automated systems must accurately distinguish machine-generated from human-written code.


SemEval-2026 Task 13, ``Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios''~\citep{orel-etal-2026-semeval-2026}, directly addresses this challenge. Subtask A is a binary classification problem: provided a code snippet, determine whether it is fully human-written (label 0) or fully machine-generated (label 1), evaluated using Macro F1 as the primary metric. One of its distinguishing features is strict OOD evaluation protocol. Training data consists of algorithmic problems in seen languages (C++, Python, Java), while the hidden test set spans unseen languages (Go, PHP, C\#, C, JavaScript) across unseen domains (research and production code). This cross-language and cross-domain evaluation design discourages reliance on language-specific syntax or generator-specific artifacts, encouraging models to learn more generalizable stylistic and semantic patterns that distinguish human from machine-generated code.

We propose a system that is built upon CodeBERT~\citep{feng-etal-2020-codebert}, conducting a systematic comparison of two configurations: (1) a frozen-backbone baseline that trains only a lightweight MLP classifier on [CLS] token representations, and (2) an improved partial-unfreeze model that strategically unfreezes only CodeBERT's final encoder layer (layer 11/12) using differential learning rates. Our enhanced classification head incorporates residual connections~\citep{he-etal-2016-resnet}, GELU activations~\citep{hendrycks-gimpel-2016-gelu}, layer normalization~\citep{ba-etal-2016-layernorm}, and class-weighted loss to address training data imbalance (48\%/52\% human/AI split). Our team ranked 44th out of 81 participating teams in Subtask A.
Our code is publicly available.\footnote{\url{https://github.com/TesNikk/SEMEVAL-2026-Task-13}}

Our main contributions are: (1) a controlled comparison showing that a fully frozen CodeBERT backbone can outperform partial fine-tuning under severe OOD conditions; (2) analysis of architectural enhancements (residual connections, GELU, LayerNorm, class-weighted loss) demonstrating that increased parameterization improves in-domain performance but not OOD generalization; and (3) an OOD error analysis exposing prediction collapse and highlighting the need for domain-invariant training strategies.
\section{Related Work}

\subsection{AI-Generated Code Detection}

Early work on AI-generated content (AIGC) detection focused primarily on natural language, encompassing zero-shot statistical methods such as DetectGPT~\citep{mitchell-etal-2023-detectgpt} and supervised classifiers built on pretrained encoders~\citep{openai2023gptdetector}. However, these methods transfer poorly to source code due to its strict syntactic constraints, non-natural-language identifiers, and language-specific structural patterns~\citep{chen-etal-2021-codex}.

Shared-task benchmarks further highlight this challenge. SemEval-2024 Task 1~\citep{kubler-etal-2024-semeval2024task1} demonstrated strong in-domain performance for text detectors but substantial degradation under distribution shift. Large-scale benchmarks such as CoDet-M4~\citep{orel-etal-2025-codet} and DroidCollection/DroidDetect~\citep{orel-etal-2025-droid} introduced multilingual and multi-generator evaluation settings, revealing persistent cross-language and cross-domain robustness gaps. ~\citep{ahmad-etal-2024-codeood} similarly showed that OOD generalization remains a critical bottleneck in code authorship attribution, directly motivating our system's partial fine-tuning design.

\subsection{Pre-trained Code Models}

Encoder-only models marked a breakthrough in code understanding. CodeBERT~\citep{feng-etal-2020-codebert} demonstrated strong cross-language transfer through joint pre-training on code--natural language pairs using masked language modeling (MLM) and replaced token detection (RTD) objectives. UniXCoder~\citep{guo-etal-2022-unixcoder} unified cross-modal understanding across 12 programming languages, while CodeT5~\citep{wang-etal-2021-codet5} introduced encoder-decoder architectures excelling in code completion and summarization. These representations capture stylistic and semantic code properties essential for authorship attribution~\citep{ren-etal-2023-crosslangcode}, though recent benchmarks show domain-specific overfitting~\citep{orel-etal-2025-codet,orel-etal-2025-droid}. Decoder-only models such as CodeGen~\citep{li-etal-2023-codegen} and CodeGeeX~\citep{zheng-etal-2023-codegeex} expanded the landscape of code-generating LLMs, becoming both generators of AI code and potential feature extractors for detection systems.


\begin{table}[ht]
\centering

\resizebox{\columnwidth}{!}{%
\begin{tabular}{llll}
\toprule
\textbf{Model} & \textbf{Venue} & \textbf{Key Innovation} & \textbf{Multilingual} \\
\midrule
CodeBERT  & EMNLP'20 & MLM + RTD bimodal & \checkmark\ (6 langs) \\
UniXCoder & ACL'22   & Cross-modal unification & \checkmark\ (12 langs) \\
CodeT5    & ACL'21   & Encoder-decoder & \checkmark\ (8 langs) \\
CodeGen   & ICLR'23  & Open decoder-only & \checkmark\ (multilingual) \\
\bottomrule
\end{tabular}}
\caption{Comparison of pre-trained code models.}
\label{tab:code-models}
\end{table}


Our work builds upon CodeBERT as the backbone encoder, motivated by its strong multilingual code representations and compact architecture suitable for compute-constrained fine-tuning experiments, extending partial adaptation techniques validated across recent multilingual benchmarks~\citep{orel-etal-2025-codet}.

\section{Dataset and Task Description}

\subsection{Task Definition}

SemEval-2026 Task 13~\citep{orel-etal-2026-semeval-2026}, Subtask A, poses a binary classification problem: given a code snippet of source code, a system must predict whether it is fully human-written (label~0) or fully machine-generated (label~1). The macro-averaged F1-score is the primary evaluation metric. It is computed equally over both classes. The task imposes three constraints: (i)~no external datasets may be used; (ii)~no models pre-trained specifically for AI-generated code detection are allowed; and (iii)~the \texttt{generator} metadata column---which identifies the specific AI model that produced each sample---must not be used during either training or inference. General-purpose or code-oriented pre-trained models such as CodeBERT, UniXCoder, and CodeT5 are permitted.

\subsection{Out-of-Distribution Evaluation Protocol}

The task's rigorous OOD evaluation protocol tests generalization across two independent axes: \emph{programming language} and \emph{application domain}. Training data covers seen languages (C++, Python, Java) in a seen domain (algorithmic problems), while the test set extends to unseen languages (Go, PHP, C\#, C, JavaScript) and unseen domains (research, production code). The four evaluation conditions are summarized in Table~\ref{tab:ood-eval}.

\begin{table}[t]
\centering

\resizebox{\columnwidth}{!}{%
\begin{tabular}{lll}
\toprule
\textbf{Setting} & \textbf{Languages} & \textbf{Domains} \\
\midrule
Seen--Seen     & C++, Python, Java    & Algorithmic \\
Unseen--Seen   & Go, PHP, C\#, C, JS  & Algorithmic \\
Seen--Unseen   & C++, Python, Java    & Research, Production \\
Unseen--Unseen & Go, PHP, C\#, C, JS  & Research, Production \\
\bottomrule
\end{tabular}}
\caption{The four OOD evaluation conditions defined by the task.}
\label{tab:ood-eval}
\end{table}

\subsection{Dataset}

The dataset is divided into three splits. The \textbf{training set} contains 500,000 samples: 238,475 human-written and 261,525 machine-generated, results in a mild class imbalance with an approximate ratio of 1:1.1 (48\%/52\% human/AI). The \textbf{validation set} comprises approximately 100,000 samples drawn from the same domain distribution as the training data and is used exclusively for model selection. 
The \textbf{test set} is a large-scale unlabeled dataset containing 500{,}000 samples.
Additionally, a separate sanity-check file contains 1{,}000 labeled examples drawn from the public test set. This split is provided solely to verify evaluation formatting and is strictly excluded from model selection and hyperparameter tuning.

Each sample in the dataset contains the fields described in Table~\ref{tab:dataset-fields}.

\begin{table}[t]
\centering
\small
\begin{tabular}{l p{3.8cm}}
\toprule
\textbf{Field} & \textbf{Description} \\
\midrule
\texttt{code}      & Source code snippet (model input) \\
\texttt{label}     & 0 = human, 1 = AI (train/val only) \\
\texttt{language}  & Programming language of the snippet \\
\texttt{generator} & AI model name or ``human'' (\emph{forbidden}) \\
\texttt{ID}        & Unique identifier (test set only) \\
\bottomrule
\end{tabular}
\caption{Dataset fields and their descriptions.}
\label{tab:dataset-fields}
\end{table}

\section{Methodology}


We formulate Subtask~A as binary classification: given a code snippet $x$, predict $y \in \{0, 1\}$ (human vs.\ AI-generated), evaluated via macro-averaged F1.


\subsection{Model Architecture}

CodeBERT (\texttt{microsoft/codebert-base})~\citep{feng-etal-2020-codebert} is used as the backbone encoder. CodeBERT follows the RoBERTa-base architecture, consisting of 12 transformer layers with a hidden dimensionality of 768 and approximately 125 million parameters. It was pre-trained on a bimodal corpus of code and natural language using masked language modeling and replaced token detection objectives, making it suitable as a general-purpose code representation model. Input code snippets are tokenized using the CodeBERT tokenizer with a maximum sequence length of 512 tokens; sequences longer than this length are truncated, and shorter sequences are padded to the maximum length.

We investigate two experimental configurations that differ in the degree of encoder adaptation.

\paragraph{Configuration A: Frozen Backbone Baseline.}
Here, all parameters of the CodeBERT encoder are frozen, and only a lightweight classification head is trained. The classification head operates on the \texttt{[CLS]} token representation and consists of three linear layers with decreasing dimensionality ($768 \rightarrow 256 \rightarrow 64 \rightarrow 2$), interleaved with ReLU activations and dropout regularization (rate 0.3). All linear layers are initialized with Xavier uniform initialization~\citep{glorot-bengio-2010-xavier} and zero biases. This configuration yields approximately 214K trainable parameters and serves as a frozen reference model.

\paragraph{Configuration B: Partial Unfreezing with Residual Head.}
Here, only the final transformer layer (layer 11) of CodeBERT is unfrozen, while layers 0-10, embeddings, and the pooler remain frozen. This enables limited task adaptation while preserving lower-level pretrained representations. A differential learning rate scheme is applied, assigning a $10\times$ lower learning rate to the unfrozen encoder layer relative to the classification head to mitigate catastrophic forgetting~\citep{kirkpatrick2017overcoming}.

The classification head has a residual pre-classifier followed by a deeper projection network. The pre-classifier applies a linear transformation ($768 \rightarrow 768$), LayerNorm, GELU, and dropout (0.3), with a residual connection adding the original \texttt{[CLS]} embedding. The resulting representation is passed through successive linear layers ($768 \rightarrow 512 \rightarrow 256 \rightarrow 64 \rightarrow 2$), each followed by LayerNorm, GELU, and dropout (0.3), except for the penultimate layer where dropout is reduced to 0.15. This configuration introduces approximately 8.2M trainable parameters ($\sim$7\% of total).

\subsection{Optimization Strategy}

Both configurations use AdamW~\citep{loshchilov-hutter-2019-adamw} (weight decay 0.01) with OneCycleLR~\citep{smith-2018-onecycle} (cosine annealing, 10\% linear warmup), batch size 32, and gradient clipping (max norm 1.0).

Configuration A (frozen backbone) uses a single learning rate of $2 \times 10^{-4}$ and standard cross-entropy loss.

Configuration B uses discriminative learning rates: $2 \times 10^{-5}$ for the unfrozen encoder layer and $2 \times 10^{-4}$ for the classification head. Class weights are applied to weighted cross-entropy.
\begin{equation}
  w_c = \frac{N}{C \cdot n_c},
\end{equation}
where $N$ is the total number of samples, $C=2$, and $n_c$ is the count for class $c$.

Both models are trained for a single epoch of 500K samples (15{,}625 steps), reflecting a compute-constrained design that balances convergence and overfitting risk. No extensive hyperparameter search or ablation was conducted; the reported configurations represent manually selected settings. Validation metrics reported throughout correspond to end-of-epoch evaluation on the in-domain validation split.


\section{Experimental Setup}


The primary metric is macro-averaged F1: $F1_{\text{macro}} = \frac{1}{2}(F1_{\text{human}} + F1_{\text{AI}})$, assigning equal weight to both classes regardless of prevalence.

\subsection{Training Protocol}

Both models are trained in strict compliance with the task constraints: no external datasets are used, and the \texttt{generator} metadata column is excluded from both training and inference. The provided test sample (1{,}000 labeled examples) is reserved exclusively as a post-hoc formatting and sanity check, and is not used for model selection or hyperparameter tuning.

\subsection{Implementation Details}

Experiments are conducted on the Kaggle platform using a single NVIDIA Tesla P100 GPU. A fixed random seed (42) is set across all relevant libraries to ensure deterministic training behavior. The maximum input sequence length is 512 tokens, and uses a batch size of 32.

\subsection{Experimental Comparison Setup}

The two configurations form a controlled experimental setup to isolate the impact of encoder adaptation. Table~\ref{tab:config-comparison} summarizes the key differences between the two configurations.

\begin{table}[t]
\centering
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lll}
\toprule
\textbf{Aspect} & \textbf{Config.\ A} & \textbf{Config.\ B} \\
\midrule
Encoder          & Fully frozen        & Last layer trainable \\
Trainable params & $\sim$214K          & $\sim$8.2M \\
Head depth       & 3 Linear            & 4 Linear + pre-clf \\
Activation       & ReLU                & GELU \\
Normalization    & None                & LayerNorm \\
Residual conn.   & No                  & Yes \\
Class weighting  & No                  & Yes (balanced) \\
Differential LR  & No                  & Yes \\
Loss             & CrossEntropy        & Weighted CE \\
\bottomrule
\end{tabular}}
\caption{Comparison of the two experimental configurations.}
\label{tab:config-comparison}
\end{table}

\section{Results}

\subsection{In-Domain Performance}

Table~\ref{tab:main-results} summarizes the training and validation results for both configurations. 

\begin{table}[t]
\centering
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lcccc}
\toprule
\textbf{Configuration} & \textbf{Train Loss} & \textbf{Train F1} & \textbf{Val Loss} & \textbf{Val F1} \\
\midrule
A (Frozen)           & 0.2800 & 0.8862 & 0.1853 & 0.9216 \\
B (Partial Unfreeze) & 0.1007 & 0.9640 & 0.0569 & 0.9841 \\
\bottomrule
\end{tabular}}
\caption{In-domain training and validation results. Both models are trained for one epoch on the same data.}
\label{tab:main-results}
\end{table}

Configuration B outperforms Configuration A across all in-domain metrics, with validation macro F1 improving from 0.9216 to 0.9841 and notably lower validation loss (0.0569 vs.\ 0.1853). This confirms that partial encoder adaptation with differential learning rates enables more discriminative representations than the frozen baseline.

\subsection{OOD Generalization}

Table~\ref{tab:ood-results} reports the macro F1 scores from our official submissions to the SemEval-2026 Task 13 leaderboard, evaluated on the hidden test set spanning unseen languages and domains. The generalization gap between validation and leaderboard performance is the primary quantity of interest.

\begin{table}[t]
\centering
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lccc}
\toprule
\textbf{Configuration} & \textbf{Val F1} & \textbf{Test F1} \\
\midrule
A (Frozen)           & 0.9216 & 0.5132 \\
B (Partial Unfreeze) & 0.9841 & 0.3026 \\
\bottomrule
\end{tabular}}
\caption{Official leaderboard results (OOD evaluation on the test set).}
\label{tab:ood-results}
\end{table}

Both configurations suffer substantial OOD drops, but the severity differs markedly. Configuration A drops from 0.9216 to 0.5132, whereas Configuration B drops from 0.9841 to 0.3026. Counterintuitively, the stronger in-domain model generalizes worse. This suggests that Configuration B's larger parameter budget ($\sim$8.2M vs.\ $\sim$214K) overfits to language-specific syntax and algorithmic patterns that do not transfer to unseen languages and domains. However, we note that this effect may also be partially influenced by limited training duration: with only a single epoch over 500K samples, the partially fine-tuned model may not have reached stable representations. Additional training (e.g., 2--3 epochs) could potentially alter this behavior, although this was not explored due to compute constraints.


\subsection{Error Analysis}

We evaluate both configurations on the 1{,}000-example labeled public test sample (777 human, 223 AI) to analyze model behavior. This split is separate from the hidden leaderboard test set; therefore, the reported per-class scores are diagnostic and do not reflect official leaderboard results. Table~\ref{tab:error-analysis} presents precision, recall, and F1 per class.

\begin{table}[t]
\centering

\resizebox{\columnwidth}{!}{%
\begin{tabular}{llccc}
\toprule
\textbf{Config.} & \textbf{Class} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} \\
\midrule
\multirow{2}{*}{A (Frozen)} & Human & 0.8565 & 0.4916 & 0.6247 \\
 & AI & 0.2870 & 0.7130 & 0.4093 \\
\midrule
\multirow{2}{*}{B (Partial Unfreeze)} & Human & 0.9615 & 0.1287 & 0.2270 \\
 & AI & 0.2444 & 0.9821 & 0.3914 \\
\bottomrule
\end{tabular}}
\caption{Per-class performance on the 1{,}000 labeled test sample.}
\label{tab:error-analysis}
\end{table}

The results reveal distinct failure modes across configurations (Table~\ref{tab:error-analysis}). Configuration~A shows balanced but mediocre performance, whereas Configuration~B exhibits \emph{prediction collapse}: it adopts a degenerate prediction strategy, classifying the vast majority of samples as ``AI-generated'' (AI recall~$= 0.9821$, human recall~$= 0.1287$). This near-constant majority-class prediction under OOD conditions indicates that partial fine-tuning has overridden generalizable representations with distribution-specific decision boundaries. Notably, the diagnostic sample contains 777~human vs.\ 223~AI examples ($\approx$3.5:1 imbalance), which may inflate AI-class precision and further amplify the apparent severity of Configuration~B's majority-class bias.

\subsection{Discussion}

The two configurations highlight a clear trade-off between in-domain accuracy and OOD robustness.

\paragraph{Frozen representations are more domain-agnostic.} Configuration A's frozen encoder was never adapted to the training distribution, paradoxically making it more robust though still inadequate on OOD data. The pre-trained features capture general code properties (lexical patterns, structural regularities) that transfer modestly across languages.

\paragraph{Fine-tuning amplifies distribution-specific features.} Configuration B achieves substantially higher in-domain performance but overfits to patterns specific to the seen languages (C++, Python, Java) and the algorithmic domain. Under language and domain shift, these learned features fail to generalize, leading to severe performance degradation and biased predictions. Additionally, Configuration~B uses class-weighted loss while Configuration~A does not, introducing a confound that may independently affect the observed prediction bias under distribution shift.

\paragraph{Inference efficiency.} Configuration~A is also more efficient at inference, with $\sim$214K vs.\ $\sim$8.2M trainable parameters, yielding faster forward passes.

\paragraph{Implications for future work.} These results suggest that standard fine-tuning is insufficient for robust cross-language generalization. Explicit domain-invariant strategies such as contrastive learning, which pulls same-class representations together while pushing apart different-class embeddings, or adversarial domain adaptation, which trains an auxiliary discriminator to make learned features domain-agnostic may be necessary to achieve stable OOD performance.

\section*{Limitations}

The 512-token input limit truncates longer programs, potentially removing structural information. Training was limited to a single epoch due to the Kaggle platform's 12-hour GPU runtime constraint; with 500K training samples and $\sim$8.2M trainable parameters, Configuration~B may be undertrained rather than solely overfitting, and additional epochs (e.g., 2--3) could potentially alter the observed generalization dynamics. Neither configuration explicitly models domain invariance, relying instead on pre-trained representations for transfer. No extensive ablation studies over architectural choices or loss configurations were performed. Finally, inference uses simple argmax prediction without ensembling or calibration, leaving room for improvement.
\newpage

% Bibliography entries for the entire Anthology, followed by custom entries
%\bibliography{anthology,custom}
% Custom bibliography entries only
\bibliography{custom}

% \appendix

% \section{Example Appendix}
% \label{sec:appendix}

% This is an appendix.

\end{document}