Jerin Romijah Tuli

2026

MindFlayer at SemEval-2026 Task 13:LACR-ENS: Calibration-Aware Ensemble Routing for Cross-Language AI-Generated Code Detection
Jerin Romijah Tuli | Talukder Naemul Hasan Naem | Md. Sartaj Alam Pritom
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper presents LACR-ENS, a calibration-aware ensemble system for detecting AI-generated code across eight programming languages (SemEval-2026 Task 13). We identify a severe asymmetric out-of-distribution (OOD) failure in fine-tuned code transformers: Expected Calibration Error doubles from 0.09 (seen languages) to 0.18 (unseen languages), and high-confidence predictions (p0.80) are wrong 39% of the time on OOD inputs. We propose Language-Aware Confidence Routing (LACR), formally equivalent to implicit per-language temperature scaling, which reduces OOD ECE to 0.11 and improves macro-F1 by +0.013 over fixed-threshold ensembling. A language-family proximity analysis reveals that syntactic distance to training languages predicts OOD F1 with Pearson r=+0.94, providing a principled, label-free signal for deployment risk assessment and motivating a continuous routing extension. Our system combines UniXCoder and GraphCodeBERT via weighted logit-level fusion and achieves macro-F1 0.538 , outperforming comparable encoder-only systems. We additionally document a HuggingFace label inversion pitfall that suppressed our initial score by approximately 0.29 F1.

pdf bib abs

MindFlayer at SemEval-2026 Task 8:DUALRAG:Answerability-Aware Generation for Multi-Turn RAG Conversations
Jerin Romijah Tuli | Md. Sartaj Alam Pritom | Talukder Naemul Hasan Naem
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Our system, DualRAG (team MindFlayer), tackles SemEval-2026 Task 8 Subtask B - generating faithful responses in multi-turn RAG conversations. The core idea is simple: before generating anything, we first check whether reference passages exist for the current question. If they do, we route through a domain-guided generation prompt that instructs the model to answer using only those passages. If they do not, we route through a strict refusal prompt that tells the model to politely decline rather than guess.We used Meta’s Llama-4-Scout-17B through the Groq API, with no training or fine-tuning - purely zero-shot prompting. A lightweight post-processing layer catches the rare cases where the model ignores its instructions: if it refuses when passages are available, we replace the response with a neutral fallback; if it answers when no passages exist, we replace it with a standard refusal. Out of 507 test tasks, only 7 needed this correction.The system ranked 8th out of 26 teams with a harmonic mean of 0.7492, beating the strongest baseline (GPT-OSS-120B at 0.639) by a notable margin. The standout result is 100% refusal accuracy on all 130 unanswerable questions - something even GPT-4o and Llama 3.1 405B failed to achieve consistently according to prior work. Our RLF score of 0.8782 shows the responses stay tightly grounded in the reference passages. The relatively lower RBagg (0.6024) reflects the challenge of matching human-written phrasing in a zero-shot setting, which we identify as the clearest direction for improvement.

Co-authors

Venues

SemEval2
WS2

Fix author