Jerin Romijah Tuli


2026

This paper presents LACR-ENS, a calibration-aware ensemble system for detecting AI-generated code across eight programming languages (SemEval-2026 Task 13). We identify a severe asymmetric out-of-distribution (OOD) failure in fine-tuned code transformers: Expected Calibration Error doubles from 0.09 (seen languages) to 0.18 (unseen languages), and high-confidence predictions (p0.80) are wrong 39% of the time on OOD inputs. We propose Language-Aware Confidence Routing (LACR), formally equivalent to implicit per-language temperature scaling, which reduces OOD ECE to 0.11 and improves macro-F1 by +0.013 over fixed-threshold ensembling. A language-family proximity analysis reveals that syntactic distance to training languages predicts OOD F1 with Pearson r=+0.94, providing a principled, label-free signal for deployment risk assessment and motivating a continuous routing extension. Our system combines UniXCoder and GraphCodeBERT via weighted logit-level fusion and achieves macro-F1 0.538 , outperforming comparable encoder-only systems. We additionally document a HuggingFace label inversion pitfall that suppressed our initial score by approximately 0.29 F1.
Our system, DualRAG (team MindFlayer), tackles SemEval-2026 Task 8 Subtask B - generating faithful responses in multi-turn RAG conversations. The core idea is simple: before generating anything, we first check whether reference passages exist for the current question. If they do, we route through a domain-guided generation prompt that instructs the model to answer using only those passages. If they do not, we route through a strict refusal prompt that tells the model to politely decline rather than guess.We used Meta’s Llama-4-Scout-17B through the Groq API, with no training or fine-tuning - purely zero-shot prompting. A lightweight post-processing layer catches the rare cases where the model ignores its instructions: if it refuses when passages are available, we replace the response with a neutral fallback; if it answers when no passages exist, we replace it with a standard refusal. Out of 507 test tasks, only 7 needed this correction.The system ranked 8th out of 26 teams with a harmonic mean of 0.7492, beating the strongest baseline (GPT-OSS-120B at 0.639) by a notable margin. The standout result is 100% refusal accuracy on all 130 unanswerable questions - something even GPT-4o and Llama 3.1 405B failed to achieve consistently according to prior work. Our RLF score of 0.8782 shows the responses stay tightly grounded in the reference passages. The relatively lower RBagg (0.6024) reflects the challenge of matching human-written phrasing in a zero-shot setting, which we identify as the clearest direction for improvement.