Lexical and Computational Semantics and Semantic Evaluation (formerly Workshop on Sense Evaluation) (2026)
up
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Ekaterina Kochmar | Debanjan Ghosh | Kai North | Mamoru Komachi
Ekaterina Kochmar | Debanjan Ghosh | Kai North | Mamoru Komachi
psy detectives at SemEval-2026 Task 10: PsyCoMark – Psycholinguistic Conspiracy Marker Extraction and Detection
Roxana Carabas | Anamaria Nacu | Lucian Isac | Daniela Gifu
Roxana Carabas | Anamaria Nacu | Lucian Isac | Daniela Gifu
We present our SemEval-2026 Task 10 (PsyCoMark) system that combines interpretable psycholinguistic signals with supervised neural modeling. Our approach includes (1) a marker-derived lexicon and LIWC-style ratio features built from span annotations, (2) binary Yes/No transformer baselines (RoBERTa and DeBERTa families) with optimized training configurations, and (3) a zero-shot TinyLlama-1.1B baseline for the full three-way setting (Yes/No/Can’t tell). Results show that marker-only features are transparent but weak, while transformer models provide substantially stronger performance; the best model, DeBERTa-v3-large, achieves 0.8339 weighted F1 on development and 0.75 weighted F1 on the competition test set. We also evaluate marker-driven heuristic relabeling of uncertain instances, which does not improve downstream performance. Overall, the submission provides a controlled, interpretable, and reproducible reference point for future work on integrating span-level psycholinguistic evidence with conspiracy detection.
wangkongqiang at SemEval-2026 Task 10: PsyCoMark- Psycholinguistic Conspiracy Marker Extraction and Detection
Wang Kongqiang | Tan Qingli
Wang Kongqiang | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 10: PsyCoMark Psycholinguistic Conspiracy Marker Extraction and Detection. on Subtask 1: Conspiracy Marker Extraction. on Subtask 2: Conspiracy Detection. To this end, we focus on English language use four different pre-trained languages models: models–distilbert–distilbert-base uncased, models–distilbert–distilbert-base-multilingual-cased, models–lxyuan–distilbert-base-multilingual-cased-sentiments-student, and models–microsoft–deberta-v3-base. We experiment with 1) the training set data is analyzed visually, 2) use the gemma-3-27b-it generative model to perform data augmentation on the training dataset through prompts for Subtask 2: Conspiracy Detection, and 3) multiple numbers of single models are trained on the training set data. We further study the influence of different hyperparameters on the single model and select the best single model for the prediction of the test set. Our submission achieved the good ranking place in the test set leaderboard. For Subtask 1, the evaluation criteria for this task mainly consist of the aggregate results of the four markers: Actor, Action, Effect, and Victim, and they are measured using the Macro F1 score. For Subtask 2, this task is essentially a binary classification task for text. Performance will be evaluated using macro-averaged F1 score. In other words, this subtask evaluated using Weighted F1 score across different sentences and cultural contexts. For Subtask 1 and Subtask 2, our best approach is to obtain the results are Macro F1 score 0.1587 and Weighted F1 score 0.7411 separately. For the final ranking, organizers will use the aggregate results of Macro F1 score and Weighted F1 score. Even so, our approach has yielded good results.
NTNU-SMIL at SemEval-2026 Task 3: Logistic-Loss Regression with Same-Language Transfer for Valence–Arousal Stance Prediction in Dimensional Stance Analysis (DimStance)
Siang-Ting Lin | Tien-Hong Lo | Yun-Ting Sun | Jhih-Rong Guo | Tung-Yen Hao | Fong-Chun Tsai | Berlin Chen
Siang-Ting Lin | Tien-Hong Lo | Yun-Ting Sun | Jhih-Rong Guo | Tung-Yen Hao | Fong-Chun Tsai | Berlin Chen
We propose NTNU-SMIL’s system for SemEval-2026 Task 3 Track B Subtask 1 Dimensional Stance Analysis (DimStance). Our approach models target-conditioned valence–arousal regression using sentence-pair encoding, dual regression heads, and a logistic-loss regression formulation. For English and Chinese, we further leverage same-language transfer from Track A and apply lightweight out-of-fold calibration with multi-seed ensembling to reduce cross-lingual scale mismatch. Post-hoc analysis shows that same-language transfer and logistic-loss regression are the main drivers of performance gains, while arousal variance collapse remains a challenge in low-resource settings such as Swahili.
MindMiner at SemEval-2026 Task 10: Multi-Model Approaches to Conspiracy Detection and Psycholinguistic Marker Extraction
Pramod Kumar Ajmeera | Akshara Sri Lakshmipathy
Pramod Kumar Ajmeera | Akshara Sri Lakshmipathy
Conspiracy narratives on social media often hide in subtle word cues and quiet reasoning patterns, making their detection a challenging task for natural language processing systems. SemEval-2026 Task 10 PsyCoMark introduces a benchmark for studying these phenomena, pairing binary conspiracy detection with the extraction of five key psycholinguistic markers: Actor, Action, Effect, Victim, and Evidence. In this paper, we examine how modern transformer-based models can grasp both the conspiratorial intent and the deeper reasoning structures behind such narratives, using rehydrated Reddit comments annotated by experts in psychology and linguistics. We test five models across these subtasks, emphasizing the gap that exists between classification and deeper discourse-level interpretation. Our best system reaches 0.80 weighted F1 on conspiracy detection and 0.16 macro F1 on marker extraction, with per-marker F1 ranging from 0.36 (Actor) to 0.00 (Victim). This work also contributes to the growing call for explainable NLP methods that integrate psycholinguistic insights to better illuminate misinformation and conspiratorial thinking online.
FER at SemEval-2026 Task 6: Analysis of Different Approaches to Unmasking Political Question Evasions
Matija Akrap | Andrija Bilić | Roko Šimpraga | Fran Račić | Luka Čuturilo
Matija Akrap | Andrija Bilić | Roko Šimpraga | Fran Račić | Luka Čuturilo
We tackle classifying evasive political answerswithin the context of SemEval-2026 Task 6 andcompare three modeling strategies: a flat base-line, a hierarchical cascade, and a multitasklearning approach. Our experiments demon-strate that a hierarchical RoBERTa-base modelachieves the best performance, particularly byleveraging the distinctiveness of the class ClearNon-Reply. Conversely, we find that stan-dard multitask learning frequently producesstructurally invalid label combinations in a sig-nificant fraction of predictions. Our demon-strations show that applying a constrained in-ference mask eliminates these errors entirelywhile improving F1 performance, whereas afully joint training approach underperforms dueto data sparsity. Finally, we employ datasetcartography to compare training dynamics be-tween the hierarchical and multitask approach.
nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
Team HausaNLP at SemEval-2026 Task 4: Narratives via Semantic Embeddings
Faisal Adam | Lukman Aliyu | Sani Aji
Faisal Adam | Lukman Aliyu | Sani Aji
This paper presents Team HausaNLP’s submission to SemEval-2026 Task 4 (Track A),which requires identifying the more narrativelysimilar of two candidate stories relative to ananchor. Narrative similarity is defined alongthree dimensions: abstract theme, course ofaction, and story outcomes. We conduct a systematic ablation comparing five approaches:a lexical TF-IDF baseline, two bi-encoderSBERT variants (all-MiniLM-L6-v2 andall-mpnet-base-v2), a paraphrase-focusedembedding model, and a cross-encoder reranker. On the 200-instance development set,all-mpnet-base-v2 achieves the best performance (61.5% accuracy, 61.48 macro-F1), outperforming both TF-IDF (54.5%) and the official SBERT baseline (55.0%). Surprisingly,the cross-encoder re-ranker (55.5%) does notimprove on the bi-encoders, which we attributeto the long-document nature of Wikipedia storysummaries exceeding the model’s effective context window. On the official test set, our primary SBERT MiniLM submission achieved61.50% accuracy (33rd of 44 teams). Our erroranalysis over 200 development instances identifies five systematic failure categories, distinctfrom the All Correct / Partial cases, including23 Lexical Trap cases, 23 Hard Cases, and 24Proposed-Recovery cases, thereby informingconcrete directions for future work.
Team HausaNLP at SemEval-2026 Task 9: Tackling Class Imbalance in Low-Resource Hausa Polarization Detection
Faisal Adam | Sani Aji | Lukman Aliyu | Abdulhamid Abubakar
Faisal Adam | Sani Aji | Lukman Aliyu | Abdulhamid Abubakar
This paper describes our submission toSemEval-2026 Task 9, Subtask 2 (Hausa). Thetask involves identifying specific categories ofpolarization (Political, Religious, Ethnic, etc.)in Hausa social media comments. The datasetpresented significant challenges, primarily extreme class imbalance and the low-resourcenature of the language. Our system uses a pretrained multilingual transformer (Afro-XLMRLarge) fine-tuned with Weighted Binary CrossEntropy loss and dynamic undersampling (1:3ratio) to mitigate the scarcity of polarized examples. On the official test set, our systemachieved an official Macro-F1 score of 0.2346and a Micro-F1 score of 0.2581. Our model isrecall-oriented (Micro-Recall: 0.6166), demonstrating strong capability in detecting polarization, though precision remains a challenge(0.1632). We achieved our best per-class performance in the Political domain (F1: 0.48).
LAFED at SemEval-2026 Task 13: Language-Agnostic Feature Engineering for Cross-Lingual AI-Generated Code Detection
Juan Villate Lemus
Juan Villate Lemus
Robust detection of AI-generated source code across programming languages remains challenging due to language-specific cues and train–test distribution shifts. We present LAFED (Language-Agnostic Feature Engineering Detector), a feature-engineering approach trained on {Python, Java, C++} and evaluated on a multilingual test set that includes unseen languages {C, C#, Go, JavaScript, PHP}. LAFED combines (i) structural skeletal features (indentation, control-flow density, and approximations of McCabe/Halstead complexity), (ii) character and whitespace statistics inspired by stylometry, and (iii) micro-style patterns (operator spacing, blank lines, indentation consistency). Using XGBoost (Chen and Guestrin, 2016) with Optuna hyperparameter search (Akiba et al., 2019), our best model achieves macro-F1=0.7570 on a 1,000-sample test set; the official submission obtains macro-F1=0.75209 (5th place in Subtask A). Per-language analysis shows strong transfer to C# (0.7753) and JavaScript (0.7683), but weaker performance on Go (0.6400) and PHP (0.5238).
ModusPonens at SemEval-2026 Task 11: Breaking the Plausibility Trap in LLMs via Conflict-Aware Ensembling
Soumyajit Roy | Manav Malhotra
Soumyajit Roy | Manav Malhotra
Large Language Models (LLMs) often struggle to disentangle formal logical validity from real-world plausibility, a phenomenon known as the "belief bias". This paper describes our submission to SemEval-2026 Task 11. We frame the task as a calibration problem between "System 1" (heuristic) and "System 2" (logical) thinking. Our experiments reveal that standard neuro-symbolic interventions, such as Structural Chain-of-Thought (CoT) and Nonsense Augmentation, degrade performance in low-resource regimes due to an "abstraction penalty". Instead, we propose a Conflict-Aware Logit Ensemble. We fine-tune two variations of Qwen-2.5-14B: a standard "Believer" model and a bias-hardened "Skeptic" model trained on oversampled conflict data. By ensembling their logits via soft-voting, we achieve a Pareto-optimal balance, reducing the Total Content Effect (TCE) to 3.21 while maintaining an overall accuracy of 94.27%, resulting in a Combined Score of 39.09.
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
A.j.w. De Vink | Filippos Karolos Ventirozos | Natalia Amat-Lefort | Lifeng Han
A.j.w. De Vink | Filippos Karolos Ventirozos | Natalia Amat-Lefort | Lifeng Han
We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis.Our development code and resources will be shared at \url{https://github.com/aaronlifenghan/ABSentiment}
wangkongqiang at SemEval-2026 Task 1: MWAHAHA- Competition on Humor Generation
Wang Kongqiang | Zhang Peng | Tan Qingli
Wang Kongqiang | Zhang Peng | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 1: MWAHAHA-Competition on Humor Generation. on Subtask A: Text-based Humor Generation. Given a set of text-based constraints, generate a joke. This subtask A will be conducted in English, Spanish, and Chinese. on Subtask B: Image-Based Caption Generation. This subtask explores humor in a multimodal context, combining visual inputs with text generation. This subtask B is in English only. To this end, we mainly focus on Subtask A: Text-based Humor Generation in English and Chinese, Subtask B: Image-BasedCaption Generation in English language to use two important languages models: BLIP and Qwen series LLM. For Task B1: Image-only Humor Generation and Task B2: Image and Prompt Humor Generation. Our submission achieved the good ranking place in the test set. All subtasks evaluated using Rating (95% CI) score across different languages and modality contexts. For Subtask A in English and Chinese, Rating score 950 and 1054, 95% CI [ 922, 982] and [1024, 1104], ranked 16th and 1st respectively. For Subtask B in B1 and B2, Rating score 976 and 987, 95% CI [ 941, 1007] and[948, 1016], ranked 5th and 3rd respectively. For the final ranking, organizers will use the Rating (95% CI) score. Even so, our approach still has yielded good results.
JCT at SemEval-2026 Task 1: Let the Best Joke Win - A Generate - and-Rank Approach to Constrained Humor
Batya Schechter | Sarah Barzel | Chaya Liebeskind
Batya Schechter | Sarah Barzel | Chaya Liebeskind
We present a humor generation system forSemEval-2026 Task 1, Subtask A (Castro et al.,2026) that produces short jokes under lexicalor headline-based constraints. For each input,our system generates multiple candidate jokesusing a large language model across diverse hu-mor styles and prompting strategies, includingzero-shot, few-shot, and structured prompting.Constraint satisfaction is explicitly enforced,either by requiring exact lexical inclusion orby approximating semantic relevance to a head-line using sentence-embedding similarity. Allvalid candidates are ranked using a weightedhumor score that combines semantic incon-gruity, emotion-based humor potential, ironylikelihood, linguistic fluency, and novelty withrespect to a large external jokes corpus, andthe single highest-scoring joke is selected foreach constraint. This approach follows a best-candidate selection paradigm, leveraging auto-mated humor proxies to improve joke qualitywithout task-specific fine-tuning.
zhangpeng at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Zhang Peng | Lu Gehao
Zhang Peng | Lu Gehao
This paper presents our system developed for the SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization. on Subtask 1: Multilingual Text Classification Challenge - Polarization Detection. on Subtask 2: Multilingual Text Classification Challenge - Polarization Type Classification. on Subtask 3: Multilingual Text Classification Challenge - Manifestation Identification. For Subtask 1, we explored classical text representation approaches including Bag-of-Words, Word2Vec Average Vectors, and Bag-of-Centroids. Among these methods, the Bag-of-Centroids model achieved the best performance on both development and test datasets. For Subtask 2 and Subtask 3, we fine-tuned four different pre-trained language models: google-bert, FacebookAI-roberta, dccuchile-bert, and distilbert-multi. We experiment with 1) the training set data is analyzed visually, 2) multiple numbers of single models are trained on the training set data, and 3) multiple number of single models for voting weight ensemble learning. We further study the influence of different hyperparameters on the integrated model and select the best integration model for the prediction of the test set. On the official test set, our system achieved Macro-F1 scores of 0.6882 (EN) and 0.6711 (SP) for Subtask 1, 0.3752 (EN) and 0.6386 (SP) for Subtask 2, and 0.3561 (EN) and 0.4366 (SP) for Subtask 3. For the final ranking, organizers will use the Macro F1 score. These approachs has yielded good results.
lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation
Alexey Tikhonov | Alexey Ivanov
Alexey Tikhonov | Alexey Ivanov
Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy—preferences vary with audience, context, and culture, and annotator agreement is often low.In this paper, we describe our system for the SemEval-2026 Task~1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons.We adopt a "generate-many - select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a “reader” by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.
kevinyu66 at SemEval-2026 Task 3: A Retrieval-Augmented LLM System for Aspect–Opinion Triplet Extraction
Kuanlin Yu | Wen-Ni Liu
Kuanlin Yu | Wen-Ni Liu
This paper describes our system used in the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis. To address the inherent subjectivity and nuanced emotional expressions in this task, we propose a Retrieval-Augmented Generation (RAG) framework based on Large Language Models (LLMs) for sentiment triplet extraction. Our approach leverages a dynamic retrieval mechanism to identify semantically similar training examples, which are then integrated into the prompts as in-context demonstrations. This strategy effectively guides the model’s inference process by providing relevant linguistic patterns and emotional contexts. Our implementation is available at https://github.com/Kevinyu66/dimaste.
Lakksh at SemEval-2026 Task 11(1 2): Neuro-Symbolic Decomposition to Mitigate Content Bias in Syllogistic Reasoning
Lakksh Sharma | Krish Sharma | Jatin Bedi
Lakksh Sharma | Krish Sharma | Jatin Bedi
Syllogistic reasoning is the ability to distinguish logical validity from semantic plausibility — a setting in which LLMs succumb to frequent content bias by conflating the two. The result is a characteristic failure to recognize logically valid arguments with highly implausible conclusions and logically invalid but semantically plausible arguments. This paper introduces a neuro-symbolic system that avoids this behavior by design: neural structure extraction is strictly separated from symbolic validity checking. A T5-Small parser is trained only on synthetic nonsense-symbol syllogisms, ensuring that the structural parse is learned in the absence of real-world semantics. Validity checking is performed by a deterministic symbolic kernel operating on extracted logical form alone, ensuring that semantic content cannot influence the final call. In binary validity classification, the system achieves 97.38% accuracy with a Total Content Effect of 3.10; in the retrieval setting, it achieves 82.11% accuracy with 99.47% F1 on premise identification. Ablation experiments show that formal theorem proving via NL-to-Z3 translation actually increases content bias due to leakage in intermediate representations. The results recommend architectural separation as a promising content-robustness strategy for syllogistic reasoning.
CuriosAI at SemEval-2026 Task 2: Predicting Emotion using RoBERTa-large model
Fumika Beppu | Hiroki Takushima | Aiswariya Manoj | Daichi Yamaga | Yuki Shibata | Takayuki Hori
Fumika Beppu | Hiroki Takushima | Aiswariya Manoj | Daichi Yamaga | Yuki Shibata | Takayuki Hori
This paper proposes a method for predicting continuous emotion dimensions, namely Valence and Arousal, from text by combining affective intermediate training with multi-task learning. The proposed approach consists of two training phases: an intermediate pre-training phase using external emotion datasets, followed by a multi-task learning phase using task-specific data. RoBERTa-large is employed as the backbone model, and independent regression heads are introduced for each subtask. Experimental results show that the proposed method achieves Pearson correlation coefficients of 0.68 for Valence and 0.45 for Arousal on Subtask 1, demonstrating stable performance, particularly in capturing inter-user differences in emotional expression.
UIT-Polar at SemEval-2026 Task 9 Detecting Multilingual, Multicultural and Multievent Online Polarization
Hoàn Trần
Hoàn Trần
We present a two-stage hybrid system forSemEval-2026 Task 9 on multilingual and mul-tievent online polarization detection. The firststage employs DeBERTa for high-recall binaryfiltering to mitigate severe class imbalance. Thesecond stage leverages Mistral for fine-grainedpolarization classification, enabling improvedsemantic reasoning over candidate instances.This coarse-to-fine design enhances robustnessand efficiency while preserving minority-classperformance. Our system achieves Top-5 results on the English test set, demonstratingthe effectiveness of integrating encoder-basedscreening with LLM-based refinement.
wangkongqiang at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Wang Kongqiang | Tan Qingli
Wang Kongqiang | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 9: Detecting Multilingual,Multicultural and Multievent Online Polarization. on Subtask 1: Multilingual Text Classification Challenge - Polarization Detection. on Subtask 2: Multilingual Text Classification Challenge - Polarization Type Classification. on Subtask 3: Multilingual Text Classification Challenge - Manifestation Identification. To this end, we focus on English and Spanish language use two different pre-trained languages models: models–google-bert–bertbase-uncased, and models–microsoft–debertav3-base. We experiment with 1) the training set data is analyzed visually, 2) use the gemma-3-27b-it generative model to perform data augmentation on the training dataset through prompts, and 3) multiple numbers of single models are trained on the training set data. We further study the influence of different hyperparameters on the single model and select the best single model for the prediction of the test set. Our submission achieved the good ranking place in the test set. All subtasks evaluated using Macro F1 score across different languages and cultural contexts. For Subtask 1, the English and Spanish language tasks are Macro F1 Score 0.7805 and 0.7155 respectively. For Subtask 2, the English and Spanish language tasks are Macro F1 Score 0.2603 and 0.4647 respectively. For Subtask 3, the English and Spanish language tasks are Macro F1 Score 0.2766 and 0.3322 respectively. For the final ranking, organizers will use the Macro F1 score. Even so, my approach has yielded good results from an overall perspective.
"AGI” Team at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Harsh Rathva
Harsh Rathva
This paper describes our submission to SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal. We combine RoBERTa-Large text encoding with a unidirectional GRU for temporal modeling and gated user embeddings for personalization. A four-phase staged training curriculum employs ordinal regression for absolute affect prediction and a zero-inflated delta model for change detection. Our approach achieves competitive performance on Subtask 1 (longitudinal affect assessment) with composite correlation r=0.600 for valence and r=0.452 for arousal. However, we observe systematic degradation in Subtask 2A (state change detection) with negative correlations (r=-0.167 for valence, r=-0.147 for arousal), revealing a fundamental trade-off between stability-oriented representations and change sensitivity. We provide detailed empirical analysis of these failure modes, contributing insights into the challenges of modeling emotional dynamics in ecological data.Code and trained checkpoints are publicly available.
Narrative Team at SemEval-2026 Task 4: Two-Stage Contrastive Learning for Narrative Similarity Assessment
Tatiana Khaidukova | Ana Ciobanu | Daniela Gifu | Diana Trandabat
Tatiana Khaidukova | Ana Ciobanu | Daniela Gifu | Diana Trandabat
For SemEval-2026 Task 4, we introduce a unified two-stage framework based on a RoBERTa-large encoder. Stage 1 performs contrastive pre-training on synthetic triplets to learn general narrative similarity patterns. Stage 2 fine-tunes the model with a ranking-based objective tailored to Track A.The resulting encoder supports both binary similarity classification (Track A) and narrative embedding generation (Track B) without architectural changes. Our system achieves an accuracy of 0.64 on Track A and 0.69 on Track B, outperforming single-stage baselines and demonstrating that combining synthetic contrastive supervision with task-specific ranking yields stable and reusable narrative representations.
CYUT at SemEval-2026 Task 3: Multi-Task Dimensional Aspect Sentiment Regression with Polar Multi-Zone Labeling in VA Space
Shih-Hung Wu | Xian-Yan Chen | Yi-Min Jian
Shih-Hung Wu | Xian-Yan Chen | Yi-Min Jian
This paper describes CYUT’s system for SemEval-2026 Task~3 Track~B, a multilingual aspect-based dimensional sentiment regression task. We formulate the task as continuous Valence–Arousal (VA) prediction and adopt a multi-task learning (MTL) framework with auxiliary tasks automatically derived from gold VA annotations, including polarity, intensity, and quadrant classification. However, these coarse-grained labels may still suffer from regional imbalance in the VA space, leaving some regions with insufficient auxiliary supervision. To address this issue, we extend the system with Polar Multi-Zone Labeling (PMZL) and use its seven-zone variant, PMZL-7. PMZL-7 partitions the VA plane into one core neutral region and six non-central zones based on the directional distribution of non-central samples. This design reduces the risk of auxiliary-label imbalance while supplementing directional information that conventional auxiliary tasks cannot directly capture. We evaluate XLM-R and two generative pretrained models. Results show that PMZL-7 is strongly model-dependent: it provides more stable improvements for Qwen2 and Ministral, while its effect on XLM-R is less consistent. On the official test set, our system achieves the best performance on the NigerianPidgin subset among all participating systems.
CSIRO-LT at SemEval-2026 Task 2: In-the-Wild Valence and Arousal Forecasting on Ecological Text Time Series
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Predicting emotional valence and arousal in text is challenging due to the continuous, dynamic, and context-dependent nature of emotions. The SemEval 2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays shared task investigates longitudinal affect prediction from real-world personal essays, including forecasting short-term state and longer-term dispositional changes. We compare Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for these subtasks, examining different input representations and feature formulations. We show that sentiment-aware PLMs are most effective for continuous valence and arousal prediction, and LLMs are effective for short-term state forecasting. Modelling dispositional changes remains challenging, and none of our neural approaches surpass simple a historical baseline approach in this setting.
CITD@UIT at SemEval-2026 Task 2: Temporal Mixture-of-Experts for Longitudinal Valence and Arousal Prediction from Ecological Essays
Son Phuong | My Ngo | Tri Minh Dao | Duc-Vu Nguyen
Son Phuong | My Ngo | Tri Minh Dao | Duc-Vu Nguyen
This paper describes our participation in SemEval-2026 Task 2, which focuses on the longitudinal assessment and forecasting of emotional states through text. The challenge is divided into two primary objectives: Subtask1, which requires estimating continuous Valence and Arousal (V&A) scores for a sequence of texts, and Subtask2, which focuses on forecasting future emotional variations, specifically State Change (2A) and Dispositional Change (2B). To address these tasks, we propose a unified framework based on cardiffnlp/twitter-roberta-base-sentiment-latest, a transformer architecture pretrained on 124 million tweets. For all subtasks, we sort the data chronologically by userid and use a sliding window approach to capture longitudinal context. We conduct extensive experiments combining this pretrained RoBERTa model with Multilayer Perceptron (MLP) and Mixture-of-Experts (MoE) architectures to optimize performance. Furthermore, we utilize both attention pooling and mean pooling on all output hidden state representations to extract richer semantic features. Our proposed system demonstrated competitive performance, officially ranking 9th in Subtask 1 and 5th in Subtask 2A among participating teams.
Hidetsune at SemEval-2026 Task 10: A Systematic Exploration of Training and Inference Strategies for Detecting Conspiracy Beliefs
Hidetsune Takahashi
Hidetsune Takahashi
This paper describes a system developed for SemEval-2026 Task 10 Subtask 2, which focuses on identifying conspiracy beliefs expressed in Reddit comments. The study begins with a comparative analysis of language models fine-tuned on the task data. In addition to fine-tuning, multiple auxiliary techniques were examined, including instruction-based prompting, data augmentation via back-translation, and loss function methods designed to address label imbalance. In the final stage, the inference behavior was further examined by varying the decision threshold applied to the softmax output probabilities. The results highlight how choices made during model selection, training, and inference collectively affect performance, offering empirical insights into the challenges of conspiracy belief detection in social media contexts.
OZemi at SemEval-2026 Task 9: A Cross-Lingual Approach to Online Text Polarization Classification Using Multilingual Models and Adaptive Loss Formulation
Hidetsune Takahashi | Eleale Nusi Tee | Aika Yu | Ruri Furukawa | Sooeun Kim | Shuta Niinomi | Dingyu Zhang | Emily Ohman
Hidetsune Takahashi | Eleale Nusi Tee | Aika Yu | Ruri Furukawa | Sooeun Kim | Shuta Niinomi | Dingyu Zhang | Emily Ohman
This paper presents the OZemi team’s submission to SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization.We propose a unified multilingual approach that addresses multiple languages and subtasks efficiently. Our system combines multilingual models with data-level techniques and a class-weighted cross-entropy loss to mitigate data imbalance across languages, subtasks, and categories. Results show consistent performance across languages, achieving macro F1 scores above 70% in most languages for Subtask 1 achieving our highest rank in subtask 1 for Persian (1 out of 44). These results suggest that the proposed framework provides a flexible foundation for multilingual and multi-task polarization analysis.
Hidetsune at SemEval-2026 Task 11: Adapting Pretrained Reasoning Models with Deep Supervision and Inference Refinement for Content-Independent Validity Classification
Hidetsune Takahashi
Hidetsune Takahashi
This paper presents a system that applies training and inference approaches for SemEval2026 Task 11 Subtask 1, which focuses on binary classification for content-independent validity reasoning in syllogistic inference. Building on fine-tuning of relatively standard language models, additional approaches were explored, including layer-wise deep supervision and in-context learning. Furthermore, models that had been previously trained on datasets related to logical reasoning were adapted to thetask through additional fine-tuning. Finally, refinement was performed at the inference stage by adjusting the softmax-based decision threshold of the selected model. The experimental results illustrate how model selection, training strategies, and threshold adjustment affect not only validity accuracy but also robustness against plausibility-driven bias, thereby contributing to improved logical integrity.
cclin at SemEval-2026 Task 2 : SLM-Enhanced Lightweight Multi-BERT Ensemble for Longitudinal Affect Assessment
Jing-Jun Lin
Jing-Jun Lin
This paper describes the system developed by team for SemEval-2026 Task 2, Subtask 1: Longitudinal Affect Assessment. Our goal is to predict Valence and Arousal from ecological essays and feeling words over time. We propose an efficient hybrid framework that uses quantized 7B-scale language models as deterministic meta-feature extractors and combines them with an ensemble of DeBERTa, RoBERTa, and DistilBERT encoders. The system is designed to run on a single consumer-grade RTX 5060 Ti (16GB) GPU while remaining competitive. To bridge discrete supervision and continuous evaluation, we train the model as an ordinal classification problem and decode class probabilities into continuous scores through expected-value decoding. Our best system achieved an overall V&A average of 0.587, with per-dimension composite correlations of 0.647 for Valence and 0.527 for Arousal, ranking 3rd out of 31 teams. The results show that lightweight SLM-derived priors and multi-encoder fusion provide a strong performance–efficiency trade-off, especially for Arousal, where contextual anchoring is crucial.
YNU-HPCC at SemEval-2026 Task 12: Retrieval-Guided Reasoning with Teacher Distillation for Abductive Event Reasoning
Yuwei Sun | Jin Wang | Xuejie Zhang
Yuwei Sun | Jin Wang | Xuejie Zhang
This paper describes the YNU-HPCC system for SemEval-2026 Task 12, Abductive EventReasoning (AER). Given multi-document retrieved evidence with distractors, the task requires selecting all direct-cause options for a target event and outputting an answer set. The main challenges are sparse and dispersed evidence in long documents and a boundary-sensitive set-level evaluation. This paper proposes a two-stage framework. Stage 1 trains a DeBERTa-v3-base student with retrieval-guided evidence modeling: documents are split into overlapping windows, BM25 ranks and filters candidate windows, and Top-K pooling aggregates window-level scores into option probabilities. Stage 2 distills soft targets from a Qwen-14B teacher with temperature scaling and high-confidence filtering to reduce pseudo-label noise and improve generalization. The system achieves an official dev score of 0.9712(micro-F1 0.9746, macro-F1 0.9745) and improves the test score from 0.46 to 0.73, ranking 84th out of 221 submissions.
Emo-tica at SemEval-2026 Task 2: Trait–State Affect Forecaster for Longitudinal Valence and Arousal
Sadia Noor | Mehwish Fatima
Sadia Noor | Mehwish Fatima
Modeling longitudinal affect requires capturing both stable user tendencies and transient textual signals. For SemEval-2026 Task 2, we propose the Trait-State Affect Forecaster (TSAF), which decomposes affect into persistent user traits and text-conditioned states integrated through adaptive gating. On per-text prediction (Subtask 1), TSAF achieves composite Pearson correlations of 0.645 for valence and 0.409 for arousal, outperforming the Linear(BERT) baseline. In forecasting tasks, results reveal strong short-term affective inertia, where prior affect dominates next-step prediction, while long-term drift remains challenging under sparse supervision; TSAF shows comparatively stronger gains for arousal in this setting. Analyses across user splits and modalities highlight the strengths and trade-offs of explicit trait-state modeling, particularly under cold-start and short-text conditions.
Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG
Sifei Meng | Dmitry Ilvovsky
Sifei Meng | Dmitry Ilvovsky
Multi-turn retrieval-augmented generation (RAG) is challenging due to evolving user intent, conversational noise, and strict context limits. We propose a training-free hybrid retrieval pipeline for SemEval-2026 Task 8 that combines dense and sparse retrieval with controlled query rewriting and cross-encoder reranking. Our system achieves 0.5453 nDCG@5 on the official test set of Task A, ranking 3rd out of 38 teams and outperforming the strongest baseline (0.4795). For Task C, we reuse the Task A retrieved documents in a lightweight generation pipeline based on the official prompt, achieving 0.5312 (harmonic mean of quality and faithfulness) and ranking 15th out of 29 teams. All retrieval components are open-source, while rewriting and generation use LLM APIs. Code and scripts are available on GitHub (https://github.com/mengsifei/MultiturnRAG).
MarSan at SemEval-2026 Task 4: Narrative Similarity via Sentence-BERT Metric Learning with Triple-Derived Losses
Maryam Najafi | Ehsan Tavan | Simon Colreavy
Maryam Najafi | Ehsan Tavan | Simon Colreavy
We describe our research to SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning (NSNRL). The shared task defines narrative similarity through comparative judgments over triples consisting of an anchor story and two candidates, where systems determine which candidate is narratively closer (Track A), and must output story embeddings whose cosine distances reproduce the same ordering under withheld evaluation triples (Track B). We implement a unified representation-learning approach based on a Sentence-BERT bi-encoder trained with triple-derived metric learning objectives, combining in-batch contrastive learning with explicit triplet and margin-ranking constraints. Track A is solved by direct cosine comparison between the anchor embedding and each candidate embedding, while Track B outputs normalized story vectors from the same encoder without any additional test-time modelling. During evaluation, we achieve 65.00% accuracy on Track A and 65.50% on Track B. These results suggest that a single, well-aligned bi-encoder can perform competitively across both tracks while remaining computationally efficient.
HU at SemEval-2026 Task 6: A Hybrid Discriminative Modeling of Political Clarity and Evasion
Taha Munawar | Basil Khan | Arsal Jangda | Sarfaraz Baig | Sandesh Kumar | Abdul Samad
Taha Munawar | Basil Khan | Arsal Jangda | Sarfaraz Baig | Sandesh Kumar | Abdul Samad
We describe our submission to SemEval-2026 Task 6: CLARITY, which aims to classify political question–answer pairs by response clarity and evasive technique. We investigate several approaches, including long-context transformers, multiple instance learning, hierarchical multi-task models, and a natural language inference (NLI) formulation. On the development set, our best-performing NLI model achieves a macro-F1 of 0.79 for Subtask 1, while our best attention-based MIL model achieves a macro-F1 of 0.43 for Subtask 2. On the hidden evaluation set, our official submission obtains macro-F1 scores of 0.81 for Subtask 1 and 0.45 for Subtask 2. Our findings demonstrate the benefits of entailment-based modeling for clarity prediction and localized reasoning for evasion detection under limited computational resources.
Team faisalm3at SemEval-2026 Task 3: From Standard Regression to Distributional Alignment in Dimensional Sentiment Analysis
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
This paper describes our participation in SemEval2026 Task 3: Dimensional Aspect-Based SentimentAnalysis (DimABSA) (Yu et al., 2026). We utilizeda pre-trained DeBERTa-V3 backbone to capturesemantic meaning through disentangled attention.While standard Mean Squared Error (MSE) loss establishes a performance floor, we propose a HybridMSE-CCCLoss to identify distributional relationships that simple regression missed. Our resultsdemonstrate a 54.6% reduction in validation losscompared to the baseline, significantly improvingdetection in high-intensity emotional bins by mitigating the "regression to the mean" phenomenon.
wangkongqiang at SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
Wang Kongqiang | Zhang Peng | Tan Qingli
Wang Kongqiang | Zhang Peng | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 7: Everyday KnowledgeAcross Diverse Languages and Cultures. on Subtask 1: Short Answer Questions (SAQ). on Subtask 2: Multiple-Choice Questions (MCQ). To this end, we focus on models’ cultural competence across 26 languages and 30 countries using four different versions large language models (LLMs): deepseek-v3.2-exp, qwen-max, qwen-plus, and qwen3-next-80ba3b-instruct. We experiment with 1) the trialand test dataset is analyzed visually, 2) use the large language generative model to perform generate or select the answer that it deems correct on the trial and test dataset through prompts, and 3) many prompt engineering approaches of generative models are evaluated on the trial dataset. We further study the influence of different hyperparameters on the generative model and select the best single model for the prediction of the test dataset. Our submission achieved the good ranking place in the test dataset leaderboard. For Subtask 1 (SAQ), the evaluation criteria for this task mainly consistof the aggregate results of the 23 languages: ar-EG, ar-MA, ar-SA, bg-BG, el-GR, en-AU, and so on, and they are measured using the accuracy score. For Subtask 2 (MCQ), this task is essentially a multiple-choice task for questions text. Performance will be evaluated using accuracy score. In other words, this subtask evaluated using accuracy score based on the correctness of the selected answer across different languages and cultural contexts. For Subtask 1 (SAQ) and Subtask 2 (MCQ), our best approach is to obtain the results in test dataset are accuracy score 51.4689 and accuracy score 80.26 separately. For the final ranking, organizers will use the aggregate results of accuracy score. Even so,our approach has yielded good results.
VAP-GameController at SemEval-2026 Task 2: Lexical-based and Emotion-Aware Approaches for Longtitudinal Emotion Prediction
Huy Le | Truong Phu | Trung Tran | Nga Nguyen | Monojit Choudhury
Huy Le | Truong Phu | Trung Tran | Nga Nguyen | Monojit Choudhury
In this work, we participate in SemEval-2026 Task 2, which focuses on predicting continuous valence and arousal trajectories from longitudinal ecological essays. To model fine-grained emotional dynamics, we explore three approaches: (1) hierarchical encoder-based models to capture contextual emotional patterns, (2) a lexicon-based pipeline with linguistic rules and a dual-level calibration mechanismfor personalized estimation, and (3) a hybrid framework that integrates lexical emotional signals into neural encoders. Experiments on the official dataset, evaluated using Pearson correlation (r) and MAE, show consistent improvements over baseline methods, highlighting the complementary strengths of neural representations and calibrated lexical features.
TeleAI at SemEval-2026 Task 13: Data-Centric Full-Parameter Fine-Tuning with Multi-Level Ensembling for Generated Code Detection
Shiquan Wang | Fang Yu | Shuangyong Song | Yongxiang Li | Xuelong Li
Shiquan Wang | Fang Yu | Shuangyong Song | Yongxiang Li | Xuelong Li
This paper presents our top-ranking system for SemEval-2026 Task 13 on code generation detection under multi-lingual and distribution-shift settings. Our approach achieved 1st place in Subtasks A and B, and 2nd place in Subtask C in the official evaluation.Our framework integrates data-centric analysis, full-parameter model adaptation, and multi-level ensemble learning. We first analyze label and length distributions and apply repeated oversampling to address class imbalance. We then optimize prompts in a data-driven manner to improve inference stability. Based on Qwen3-30B-A3B-Instruct, we conduct full-parameter fine-tuning with diverse training configurations and integrate multiple checkpoints using soft voting, hard voting, logits-based voting, and LightGBM stacking.Experimental results demonstrate substantial improvements over zero-shot baselines and consistent gains from ensemble strategies, validating the effectiveness of systematic adaptation and ensembling for robust code generation detection.
CodeHunters at SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Daniel-Antoniu Dumitru | Simina Lazăr | Nicoleta Danilă (amargheoalei) | Daniela Gîfu | Diana Trăndăbăț
Daniel-Antoniu Dumitru | Simina Lazăr | Nicoleta Danilă (amargheoalei) | Daniela Gîfu | Diana Trăndăbăț
We participated in Subtasks A and B, where we fine-tuned 3 different pre-trained models (UniXCoder, CodeT5 and codeBERT). The paper describes the detailed approach for both of the subtasks.
YNU-HPCC at SemEval-2026 Task 11: Mitigating Content Effects in Syllogistic Reasoning with Qwen2-1.5B-Instruct and XLM-RoBERTa-Large for English and Multilingual TasksMultilingual Tasks
Rongchuan Luo | Jin Wang | Xuejie Zhang
Rongchuan Luo | Jin Wang | Xuejie Zhang
This paper addresses SemEval-2026 Task 11, which focused on mitigating content effects in syllogistic reasoning. Logical validity is often conflated with semantic plausibility in large language models.Prior methods rely on standard fine-tuning or prompting, without explicit bias control.A rule- and template-based symbolic data augmentation framework is proposed for fine-tuning the \texttt{Qwen2-1.5B-Instruct} model and instruction-tuning the \texttt{XLM-RoBERTa-large} model. Logic-preserving synthetic data are generated through lexical rules. The system is ranked 1st in Task 1 with a perfect overall score of 100, and 6th in Task 3 with a score of 56.97. Code is publicly available at: \url{https://github.com/YNU-HPCC/semeval-2026-task11}.
PuerAI at SemEval-2026 Task 5: Homograph Appropriateness Assessment via DeBERTa Contrastive Regression and Contextual Grouping
Jiaxu Dao | Zhuoying Li | Hangchao Ma | Jinli Tong | Xiaoli Lan | Yifan Lu | Zhanji Yang
Jiaxu Dao | Zhuoying Li | Hangchao Ma | Jinli Tong | Xiaoli Lan | Yifan Lu | Zhanji Yang
To assess homograph appropriateness in narrative contexts for SemEval-2026 Task 5, we propose a contrastive regression framework. This approach combines candidate sense definitions with full narrative texts to establish an MSE regression baseline, further enhanced by a contextual grouping ranking loss that models relative rationality among senses. Evaluated on the official AmbiStory dataset, our method consistently outperforms the baseline in accuracy and Spearman correlation. These results validate the efficacy of relative order modeling for capturing fine-grained semantic nuances in complex narratives. The code is available at: https://github.com/daojiaxu/Semeval2026task5.
We present our system for the DimASR subtask of SemEval-2026 Task 3: DimABSA, targeting dimensional sentiment regression of Valence-Arousal scores in English restaurant reviews. Our approach leverages Qwen3 large language models combined with contrastive LLM-based data augmentation to enrich training data and capture subtle affective variations. Experiments show that this data augmentation framework significantly improves performance on the DimASR task, particularly in capturing subtle affective shifts at the aspect level. Finally, our system achieves a score of 1.227 RMSE on the test set.
YNU-HPCC at SemEval-2026 Task 2: Contrastive Calibration and Temporal Modeling for Continuous Valence-Arousal Prediction
Xin Lan | Jin Wang | Xuejie Zhang
Xin Lan | Jin Wang | Xuejie Zhang
This paper addresses continuous affect modeling in SemEval-2026 Task 2 through two task-specific architectures tailored to static state estimation and dynamic change prediction. To mitigate semantic ambiguity and annotation subjectivity in Subtask 1, a hard-prompt-based regression model is developed and enhanced with unsupervised contrastive learning (SimCSE) and supervised contrastive calibration (SCL) grounded in an external affect lexicon. This design improves the structural consistency and scale stability of textual representations in the Valence–Arousal (V/A) space. For Subtask 2a, which involves irregular time intervals and historical dependencies, a Time-Aware LSTM architecture is introduced to integrate current affective states with temporally enriched historical trajectories. Experimental results show that the YNU-HPCC system ranks 2nd in both subtasks. In Subtask 1, the Valence and Arousal scores are 0.677 and 0.528, respectively; in Subtask 2a, they are 0.692 and 0.647.
PICT at SemEval-2026 Task 3: A Transformer-Based System for Dimensional Aspect-Aware Sentiment Regression with Weighted Layer Pooling
Aditya Bhalgat | Omkar Jagtap | Anupama Phakatkar
Aditya Bhalgat | Omkar Jagtap | Anupama Phakatkar
Team PICT’s submission for SemEval-2026 Task 3 (DimASR) tackles continuous valence and arousal prediction by heavily focusing on variance reduction and avoiding cross-domain negative transfer. We built strictly domain-isolated pipelines for the Laptop and Restaurant datasets using a RoBERTa-Large backbone. Our architecture extracts a rich feature hierarchy using weighted layer pooling, isolates local context with a [CLS]-driven aspect-aware attention module, and maps to the continuous space using a deep residual regression head. Regularized via R-Drop and SWA, our system achieved 3rd place in the Restaurant domain (RMSE: 1.195) and 9th in the Laptop domain (RMSE: 1.326).
YNU-HPCC at SemEval-2026 Task 6: Hierarchical Taxonomy Prompting and CoT Distillation for Political Clarity Classification
Canning Wen | Jin Wang | Xuejie Zhang
Canning Wen | Jin Wang | Xuejie Zhang
In political interviews, politicians frequently employ evasion strategies to avoid direct answers, making it challenging to evaluate response clarity in Natural Language Processing. This paper presents the YNU-HPCC system for SemEval-2026 task 6: clarity classification in political interviews. To address the limitation where traditional models capture only surface-level semantics, this paper proposes two reasoning-enhanced frameworks. First, we introduce Hierarchical Taxonomy Prompting. This method guides LLMs to follow a strict top-down classification logic. Specifically, the model determines the clarity level before identifying specific evasion techniques. Furthermore, it explicitly articulates the reasoning process. Second, to balance reasoning capability with resource constraints, we employ Chain-of-Thought Distillation. We utilize DeepSeek V3.1 as a teacher model to generate comprehensive reasoning chains, which are then used to SFT the smaller student models. Experimental results demonstrate the effectiveness of our approach: The system achieved 6th place in Task 1 and 5th place in Task 2 among all participating teams, highlighting the importance of reasoning processes in detecting complex linguistic evasion.
mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection
Dominik Macko | Alok Debnath | Jakub Simko
Dominik Macko | Alok Debnath | Jakub Simko
SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation. Therefore, its detection before it escalates is crucial for a safer and more inclusive online space. We have coped with this SemEval task by finetuning mid-size LLMs for the sequence-classification task using the QLoRA parameter-efficient finetuning technique. The training data augmented the multilingual (22 languages) training sets by anonymized, lower-cased, upper-cased, and homoglyphied counterparts, making the detection more robust.
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Adam Skurla | Dominik Macko | Jakub Simko
Adam Skurla | Dominik Macko | Jakub Simko
Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task 13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.
DUTIR at SemEval-2026 Task 8: A Hybrid Retrieval and Faithfulness-Guarded Framework for Multi-Turn RAG
Jin Yang | Yichong Chen | Liang Yang
Jin Yang | Yichong Chen | Liang Yang
This paper describes the system submittedby DUTIRtaskC for SemEval-2026 Task 8:MTRAGEval (Task C). Multi-turn RetrievalAugmented Generation (RAG) poses significant challenges in context tracking, retrievalprecision, and hallucination mitigation. Ourproposed system addresses these by employinga multi-stage pipeline consisting of: (1) LLMbased query rewriting (powered by GPT-5.2) toresolve conversational dependencies; (2) a hybrid retrieval module combining dense embeddings (BGE-M3) and sparse retrieval (BM25)with Reciprocal Rank Fusion (RRF); (3) aconfidence-based answerability gating mechanism; and (4) a post-generation faithfulnessguard. Experimental results on the blind test setshow that our approach achieves a CompositeScore of 0.5576, ranking 4th out of 29 participating teams. Detailed analysis reveals that oursystem significantly outperforms strong baselines in faithfulness and successfully handlesunderspecified queries.
NLP-FSDM at SemEval-2026 Task 2: Temporal Smoothing and CCC-MAE Optimization for Balanced Longitudinal Affect Assessment
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
This paper describes the NLP-FSDM system for SemEval-2026 Task 2, Subtask 1 on longitudinal affect assessment. The task requires predicting Valence and Arousal (V & A) scores for sequences of ecological essays and feeling words written over time. We adopt ModernBERT-large as a text encoder and formulate the task as a joint regression problem optimized using a Concordance Correlation Coefficient (CCC) loss combined with a lightly weighted Mean Absolute Error (MAE) term. To reduce variance induced by fine-tuning large transformers on relatively small user-specific datasets, we employ a three-seed ensemble. Finally, we introduce a lightweight post-inference temporal smoothing mechanism applied per user to improve within-user consistency. Our system achieves an rcomposite of 0.546 for Valence and 0.453 for Arousal, demonstrating stable cross-dimensional performance without explicitly modeling sequential dependencies.
Team Macaroni at SemEval-2026 Task 10: PsyCoMark: Psycholinguistic Conspiracy Marker Extraction and Detection
Rofaida Rabehi | Nicolai Plenk | Miriam Han
Rofaida Rabehi | Nicolai Plenk | Miriam Han
This paper describes our submission to SemEval-2026 Task 10: PsyCoMark, which addresses span-level identification of psycholinguistic conspiracy markers and document-level conspiracy classification. For Subtask 1, we fine-tune several pretrained transformer encoders and analyse their behaviour under different training configurations. For Subtask 2, we develop a hybrid system that combines ModernBERT-large with surface-level linguistic features.Our results show that straightforward fine-tuning of strong pretrained models is more effective than more complex pipelines and that additional handcrafted features do not yield consistent improvements. On the official test set, we rank 18th in Subtask 1 (overlap-based macro F1 = 0.16) and 20th in Subtask 2 (macro F1 = 0.76).
Tralaleros at SemEval-2026 Task 9: Multilingual Polarization Detection with Transformer-based Models
Adrian Dahl | Bado Völckers | Adam Mierzwa
Adrian Dahl | Bado Völckers | Adam Mierzwa
We present a multilingual polarization detection system for SemEval-2026 Task 9 (Subtask 1), covering 22 languages with transformer-based models. We evaluate four strategies: data rebalancing, hyperparameter optimization, model scaling, and ensembling, and show that undersampling harms performance, while larger pretrained models improve results substantially. Our best single model, XLM-RoBERTa Large, achieves a Macro-F1 of 0.7929, with analysis showing complementary strengths across model families (e.g., RemBERT for several Indic languages and mDeBERTa for Semitic/morphologically rich languages). Ensemble gains are marginal, suggesting language-aware routing is more promising than uniform aggregation. We also provide a privacy-preserving Firefox extension that runs local ONNX inference for practical deployment without sending user text to external servers.
Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection
Ruslan Berdichevsky | Shai Nahum-Gefen | Elad Ben-Zaken
Ruslan Berdichevsky | Shai Nahum-Gefen | Elad Ben-Zaken
Large language models have transformed code generation, raising concerns around authorship, assessment integrity, and software trust. SemEval-2026 Task 13 Subtask A operationalizes detection as binary classification over code snippets, with a particular emphasis on out-of-distribution (OOD) generalization across unseen programming languages and application domains. We propose a SALSA-style formulation, Single-pass Autoregressive LLM Structured Classification, that maps each class to a dedicated output token and trains the model to emit a single-token label in a structured response. Rather than engineering hand-crafted features or decision rules, this formulation delegates the authorship decision to the model. To improve OOD robustness, we combine balanced sampling across languages with parameter-efficient fine-tuning and conservative training (low learning rate, single epoch) to avoid overfitting to the training domain. Our best system achieves OOD F1 = 0.789 on the official leaderboard, substantially outperforming the CodeBERT baseline (F1 = 0.305).
PUEB-DimASR at SemEval-2026 Task 3: Escaping the Mean Regression Trap with Graph-Enhanced Transformers for Dimensional Aspect-Based Sentiment Regression
Oskar Riewe-Perła | Agata Filipowska
Oskar Riewe-Perła | Agata Filipowska
The DimABSA shared task aims to combine dimensional analysis with Aspect-Based Sentiment Analysis (ABSA). It addresses the lack of continuous sentiment representation, as opposed to categorical labels (e.g., positive, negative, or neutral), and enriches it with an assessment of arousal. Our team’s PUEB-DimASR investigates the "mean-regression trap" — the tendency of standard MSE loss in high-dimensional sentiment tasks to over-predict values closer to the global mean. We propose a two-step advancement in model ar chitecture. First, we enhance baseline Trans formers with Graph Convolutional Networks(GCN) to capture syntactic aspect-sentiment dependencies. Second, we evaluate and recommend a Hybrid loss function that combines Mean Squared Error (MSE) and Concordance Correlation Coefficient (CCC).Our proposed GCN-deBERTa model consistently outperforms the baseline across six target languages. While MSE loss yields the best RMSE scores for English (0.876) and Chinese (0.546), it introduces significant variance collapse, which we successfully mitigated using the Hybrid loss, achieving near-perfect distributional alignment (99.6\%). Additionally, our model trained with the Hybrid loss achieved the best RMSE scores for Russian (1.136), Tatar (1.207), and Ukrainian (1.178).
YNJTC at SemEval-2026 Task 11: A Neuro-Symbolic Hybrid Pipeline for Content-Independent Syllogistic Reasoning
Junhao Fu | Yun He | Lina Zhao | Weijuan Li
Junhao Fu | Yun He | Lina Zhao | Weijuan Li
This paper presents a neuro-symbolic hybrid pipeline for SemEval-2026 Task 11 that addresses the content effect in syllogistic reasoning. The system converts natural-language syllogisms into formal mood-figure representations via regex parsing and LLM-powered extraction, then determines validity through symbolic table lookup against the 24 classically valid forms. The approach achieved a perfect Combined Score of 100.0 on Subtask 1 and competitive results on all four subtasks.
Draken at SemEval-2026 Task 2: Frozen BERT Embeddings with Ridge Regression for Predicting Emotional Valence and Arousal
Rajalakshmi Sivanaiah | Angel Deborah S | Krishna Varun R | Krishnaraj N
Rajalakshmi Sivanaiah | Angel Deborah S | Krishna Varun R | Krishnaraj N
We present a lightweight and computationally efficient system for Subtask 1 of SemEval-2026 Task 2, which focuses on predicting longitudinal variation in emotional valence and arousal from ecological essays. Our approach uses frozen contextual embeddings from BERT-base-uncased to obtain mean-pooled sentence representations without fine-tuning the transformer. These 768-dimensional embeddings are fed into a multi-output Ridge regression model to jointly predict normalized valence and arousal scores.The system emphasizes simplicity, reproducibility, and efficiency, avoiding complex temporal architectures, external lexicons, or user metadata. Despite its simplicity, the model achieves strong performance for valence prediction (r = 0.594) and moderate performance for arousal prediction (r = 0.296). Detailed evaluation across seen and unseen users, as well as between-user and within-user splits, shows that between-user correlations are consistently higher, and that valence is substantially easier to predict than arousal. These findings suggest that frozen transformer embeddings combined with linear regression provide a competitive and interpretable baseline for longitudinal affect prediction tasks.
NLPGroup8 at SemEval-2026 Task 2: Diverse Ensembles and Hierarchical Transformers for Emotional State Prediction
Troy Arthur | Aidan Kelley | Sierra Reschke
Troy Arthur | Aidan Kelley | Sierra Reschke
Our approach combines a diverse ensemble for Subtask 1 with a context-aware transformer aggregation architecture for temporal forecasting in Subtasks 2A and 2B. The ensemble achieved state-of-the-art performance for the Subtask 1 Valence metric, ranking first in Valence prediction. Our Subtask 2B independent architecture ranked second in Valence prediction and fourth in Arousal prediction among competitive submissions. We also report results for Subtask 2A, analyzing challenges our architecture faced with next-entry affect forecasting. These findings underscore the significance of our methodology for affective prediction, achieved without reliance on external affective datasets.
The Counterfactuals at SemEval-2026 Task 9: Can Counterfactually-Inspired Preprocessing help Detect Polarization?
Teagan Johnson
Teagan Johnson
This paper presents the English-language submissions of The Counterfactuals team for the three subtasks of Task 9 at SemEval 2026. The task aims to detect multicultural online polarization, how it is expressed, and in what contexts. The task provides a high-quality annotation dataset of posts that follows a three-level schema: polarized or not (subtask 1), polarization type classification (subtask 2), and manifestation identification (subtask 3). I construct a pointwise mutual information-based lexicon that identifies highly-correlated words with the polarized class as labeled in subtask 1. Using this lexicon, I implement a large language model data augmentation technique. I then use the preprocessed datasets to finetune a BERT model (BERTweet) for each subtask. My highest performing models placed 48th out of 60, 35th out of 36, and 17th out of 24 on subtasks 1, 2, and 3 respectively. All code is available on GitHub.
CCNU at SemEval-2026 Task 10: Conspiracy Marker Extraction and Detection via Multi-task Learning and LLM-based Data Augmentation
Zijun Wang | Guanyi Chen
Zijun Wang | Guanyi Chen
This paper presents the system of CCNU forSemEval-2026 Task 10: Psycholinguistic Con-spiracy Marker Extraction and Detection. Thetask requires identifying fine-grained conspir-acy markers that characterize conspiracy think-ing, as well as determining whether a Redditcomment constitutes conspiratorial discourse.For Conspiracy Marker Extraction (Subtask 1),we adopt a Unified Multi-Task Sequence La-beling Framework that jointly models multi-ple conspiracy markers within a single labelingspace. This formulation enables collaborativelearning across marker types while maintaininga compact architecture. For Conspiracy Detec-tion (Subtask 2), we formulate the problem assentence-level classification. Across both sub-tasks, we apply data augmentation powered bylarge language models and ensemble inferenceto improve robustness and generalization. Oursystem achieves strong performance on Sub-task 1, ranking 3rd on the official test set, anddelivers competitive results on Subtask 2.
HCMUS RepeatedGames at SemEval-2026 Task 12: CausalRAG: Synergizing Causal Graph Retrieval and Extended LoRA for Abductive Reasoning
Duy Minh Dao Sy | Nguyen Tran | Trung Kiet Huynh | Phu Quy Nguyen Lam | Phu Hoa Pham
Duy Minh Dao Sy | Nguyen Tran | Trung Kiet Huynh | Phu Quy Nguyen Lam | Phu Hoa Pham
This paper presents our system developed for SemEval-2026 Task 12: Abductive Event Reasoning (AER). The shared task aims at identifying the most plausible cause of a real-world event from multiple-choice options, given retrieved documents as evidence. In this work, we propose using hybrid retrieval that combines BM25 keyword matching with dense semantic search to capture explicit causal keywords. Moreover, we apply extended LoRA fine-tuning that trains both attention and MLP layers of a 32-billion parameter language model with only 0.81% trainable parameters. For final refinement, we perform development set fine-tuning to leverage validation data before inference. We achieve a tie for fifth place in the shared task: our system achieves a score of 0.90 on the official test set evaluation, ranking tied for fifth among participating teams and representing a +0.27 improvement over our baseline.
UIT-AMMC at SemEval-2026 Task 13: Exploiting Structural Formatting Signatures for Robust AI-Generated Code Detection
Cuong Pham | Minh Nguyen | Minh Le | An Nguyen | Chinh Nguyen
Cuong Pham | Minh Nguyen | Minh Le | An Nguyen | Chinh Nguyen
We participated in Subtask A with our Structure-Aware Contrastive Cascade, a multi-stage architecture designed to distinguish between human-authored and machine-generated code by integrating generative reasoning with explicit structural linguistic features. Our system focuses on exploiting structural formatting signatures that frequently emerge in AI-generated code as a byproduct of post-training alignment and readability optimization. The pipeline utilizes a Qwen-2.5-Coder 14B model fine-tuned via QLoRA, incorporating stochastic data augmentation techniques to ensure robustness across unseen programming languages. Final classification is achieved through a late-fusion mechanism that combines contrastive probability scores with statistical metrics of code presentation density. For samples exhibiting high epistemic uncertainty, we implement a multi-agent adversarial debate step to refine the final verdict. This approach enabled our system to achieve a Macro F1 score of 0.802, ranking 3rd on the official leaderboard.
NUST CodeIntel at SemEval-2026 Task 13: Cross-Domain Detection of Machine-Generated Code via Stylometric Features and Transformer Models
Azher Ali | Mehwish Fatima
Azher Ali | Mehwish Fatima
We present our submission to SemEval-2026 Task 13 on cross-language and cross-domain detection of machine-generated code. We compare TF-IDF-based models with stylometric features against LoRA-tuned transformer encoders. While transformers achieve near-perfect in-distribution performance, they degrade sharply on unseen languages and domains. In contrast, a TF-IDF + Logistic Regression model attains the best test Macro-F1 and shows greater robustness. These results highlight the limitations of neural models under distribution shift and the strength of lexical baselines for cross-domain generalization.
CuriosAI at SemEval-2026 Task 4: A Comprehensive Study of Zero-Shot versus Fine-Tuned Approaches for Narrative Similarity
Yuki Shibata | Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Daichi Yamaga | Takayuki Hori
Yuki Shibata | Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Daichi Yamaga | Takayuki Hori
This paper presents our system for SemEval-2026 Task 4 on narrative similarity assessment.Through comprehensive experimentation, we evaluated various approaches including zero-shot pre-trained models, prompt engineering with large language models, and multiple fine-tuning strategies using synthetic data. Our experiments revealed a surprising finding: pre-trained sentence transformers in a zero-shot setting consistently outperformed all fine-tuning attempts. Specifically, our best system using sentence-transformers/sentence-t5-xl achieved 67.5% accuracy on the development set (95% CI: [61.0%, 74.0%]), while all fine-tuning approaches resulted in performance degradation of 2-18 percentage points. We provide a detailed analysis of why fine-tuning failed and discuss the implications for narrative similarity tasks.
YNU-HPCC at SemEval-2026 Task 4: Narrative Similarity via Multi-Perspective E5-Mistral and Embedding Routing
Feiyang Song | Jin Wang | Xuejie Zhang
Feiyang Song | Jin Wang | Xuejie Zhang
This paper presents the system developed by the YNU-HPCC team for SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. The task challenges computational systems to identify narrative similarity across three orthogonal dimensions: abstract theme, course of action, and outcomes. The primary scientific difficulty lies in distinguishing the underlying structural fabula from surface-level lexical overlaps, particularly when facing long-context narratives with subtle plot twists. To address this, our approach employs a hybrid architecture that strategically decouples retrieval and ranking tasks. For Track A, we introduce a dynamic routing mechanism where an instruction-tuned E5-Mistral-7B model handles clear cases, while ambiguous hard samples are routed to a Gemini-3-Flash reasoner. For Track B, we leverage the global semantic modeling capabilities of Gemini-Embedding-001 via a structure-preserving chunking strategy, enhanced by All-But-The-Top (ABTT) during inference. Extensive experiments on the official test set show that this divide-and-conquer strategy effectively balances local instruction following with global open-domain generalization. Our system performs competitively, ranking 5th in Track A and 2nd in Track B among all participating teams.
SemEval-2026 Task 10 is focused on conspiracy detection. Specifically, the goal is to detect whether a Reddit comment expresses a conspiracy belief. Our submitted mdok-style system utilizes data augmentation and self-training (to cope with a rather small amount of training data) to finetune the Qwen3-32B model for a binary text-classification task. The submitted system is very competitive, ranking in the 85th percentile (8th out of 52 submissions). The results shown that our approach, which originated in machine-generated text detection, can be used for conspiracy detection as well.
RvH-40 at SemEval-2026 Task 11: Disentangling Reasoning from Belief through Symbolic Abstraction
Niek Biesterbos | Mark Den Ouden | Janiek De Rijke
Niek Biesterbos | Mark Den Ouden | Janiek De Rijke
Large Language Models (LLMs) often struggle with syllogistic reasoning due to "belief bias," where semantic world knowledge overrides formal logical structure. In this paper, we present our submission for the SemEval-2026 Task 11 shared task. We investigate the discrepancy between a model’s latent logical capabilities and its performance on natural language text. By employing symbolic transformations, specifically variable and pseudoword substitution, we demonstrate that models like Qwen2.5-14B possess strong inherent reasoning skills that are suppressed by linguistic content. We propose a "logic alignment" strategy using Low-Rank Adaptation (LoRA) to bridge this gap. Our final model achieved a near-perfect accuracy of 97.92% on the validation set and 96.34% on the official hidden test set, effectively eliminating content bias while maintaining robust generalization across abstract formats.
Ajman University at SemEval-2026 Task 2: Overcoming Scale Collapse in Temporal Emotion Modeling via Residual Learning
Haseebullah Jumakhan | Soud Assad | Seyed Abdullah | Mahmoud Al-Ayyoub
Haseebullah Jumakhan | Soud Assad | Seyed Abdullah | Mahmoud Al-Ayyoub
Ajman University Team develops a set of specialized architectures for longitudinal affective forecasting for SemEval-2026 Task 2. We establish a baseline for our performance with a standard transformer model that sets our performance floor in Subtask 1 (ranked 18). In Subtask 2A (ranked 7) and Subtask 2B (ranked 8), our main contribution is to address the problem of scale collapse. To address the scale collapse, we use a novel "bifurcated leviathan" architecture to combine residual learning with target scaling. Our additional contribution is that we counteract the effects of regression to the mean by using optimized covariance via specialized objective functions (CCC and Huber). We use these objective functions while enforcing strict user level data splits. Finally, we show empirically that standard gradient stabilization methods decrease zero shot cross subject generalization, even when they optimize intra subject memorization.
Team TüLK at SemEval-2026 Task 1: Humor Generation with Qwen and Group Relative Policy Optimization
Konrad Brüggemann | Luting Hou
Konrad Brüggemann | Luting Hou
This paper addresses the challenge of computational humor generation proposed in SemEval-2026 Task 1: Humor Generation. Our approach leverages Group Relative Policy Optimization, with an LLM serving as the policy and a custom joke rating model providing a reward signal. We demonstrate that this framework is an effective and computationally efficient approach, reliably producing genuinely funny content that adheres to task constraints.
UMUTeam at SemEval-2026 Task 6: Soft-Voting Transformer Ensembles for Detecting and Classifying Response Ambiguity in Political Discourse
Tomás Bernal-Beltrán | Ronghao Pan | Jorge Gómez-Navalón | José Antonio García-Díaz | Rafael Valencia-Garcia
Tomás Bernal-Beltrán | Ronghao Pan | Jorge Gómez-Navalón | José Antonio García-Díaz | Rafael Valencia-Garcia
Political discourse frequently involves strategically ambiguous responses, particularly in high-stakes settings such as presidential debates and interviews. Detecting whether a politician has directly answered a question, provided an ambiguous reply or issued a clear non-reply remains a challenging task due to the pragmatic and rhetorical nature of political language. This paper describes our participation in the SemEval 2026 CLARITY shared task on response ambiguity detection and classification in English. We focused exclusively on Task 1 (Clarity-level Classification) and proposed a weighted soft-voting ensemble that combines four fine-tuned encoder-only transformer models: RoBERTa-large, BERT-large-cased, DistilBERT-cased and ModernBERT-large. Each model was optimized through grid search and their predicted class probability distributions were aggregated using a weighted linear combination. On the official test set, our system achieved a macro-F1 score of 0.71, ranking 26th out of 41 participating teams. Even with the performance gap compared to top-ranked systems, our results demonstrate that a lightweight set of moderately sized encoder models can provide stable and competitive performance without relying on external data or large-scale architectures.
UMUTeam at SemEval-2026 Task 10: Transformer Ensembles for Conspiratorial Span Extraction and Detection
Jorge Gómez-Navalón | Ronghao Pan | Tomás Bernal-Beltrán | José Antonio García-Díaz | Rafael Valencia-Garcia
Jorge Gómez-Navalón | Ronghao Pan | Tomás Bernal-Beltrán | José Antonio García-Díaz | Rafael Valencia-Garcia
Conspiracy theories pose significant societal risks and require reliable automated detection methods. In this paper, we present our system for SemEval 2026 Task 10, addressing both conspiracy detection and psycholinguistic marker extraction. We leverage multiple pretrained transformer architectures and ensemble strategies to model conspiratorial discourse at both document and token levels. For classification, our ensemble achieves a weighted F1-score of 0.7688, indicating effective performance in distinguishing conspiratorial statements. For marker extraction, we formulate the task as a BIOES sequence labeling problem and enhance predictions through ensemble and specialist models. Our results highlight both the effectiveness of transformer-based approaches and the challenges of fine-grained conspiracy marker extraction.
CUETLuminaries at SemEval-2026 Task 11 Disentangling Logical Validity from Semantic Plausibility through Canonical Abstraction
Adnan Faisal | Shiti Chowdhury
Adnan Faisal | Shiti Chowdhury
Determining whether large language models (LLMs) perform genuine formal reasoning or rely on semantic heuristics is a key challenge in NLP. Syllogistic reasoning constitutes a theoretically principled evaluation paradigm where validity is fully determined by quantifier structure, allowing systematic analysis of structural inference disentangled from semantic plausibility.SemEval-2026 Task-11, Subtask-1: Disentangling Content and Formal Reasoning in Language Models, establishes a multilingual benchmark designed to rigorously isolate formal logical validity from semantic plausibility effects. The subtask evaluates English syllogistic reasoning under a binary classification setting using Overall Accuracy (ACC) and Total Content Effect (TCE), where lower TCE indicates stronger resistance to content-induced bias.Our proposed approach combines cross-validation, structured aggregation and bias-aware evaluation to optimize the robustness–performance trade-off. It achieves 93.19\% accuracy with a TCE of 3.13, yielding a strong combined score of 38.56 under the official evaluation metric. Condition-wise and multi-run analysis confirms that robustness-focused optimization curbs content-driven errors, reinforcing the necessity of bias-aware training for formal inference
CuriosAI at SemEval-2026 Task 10:Hybrid approaches to conspiracy span extraction and conspiracy detection
Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Yuki Shibata | Takayuki Hori | Daichi Yamaga
Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Yuki Shibata | Takayuki Hori | Daichi Yamaga
We present CuriosAI’s system for SemEval-2026 Task 10, addressing Conspiracy Marker Extraction and Conspiracy Detection. For marker extraction, we employ multi-label token classification with a bidirectional transformer (DeBERTa-v3-large) to predict overlapping spans. Alternative feature-based and LLM-based approaches do not surpass the encoder baseline. For Conspiracy Detection, we compare heterogeneous models, including transformer fine-tuning, lexical classifiers, embedding-based models, and LLM-based refinement. Development-optimal models do not always generalize best; logit-level ensembling achieves the strongest test performance (F1=0.7620). These results highlight the importance of bidirectional token modeling for span extraction and calibration-aware ensembling for robust detection.
AlphaLyrae at SemEval-2026 Task 9: Metric Learning and Asymmetric Loss for Chinese Polarization Analysis
Minh-Hoang Le | Khoan Phung
Minh-Hoang Le | Khoan Phung
For the Chinese track of SemEval-2026 Task 9 (Detecting Online Polarization), we address two key challenges: polarized content frequently uses implicit language (e.g., homophones and coded terms) to evade moderation, and class distributions exhibit severe long-tail imbalance. We propose a metric learning approach that frames polarization detection as semantic similarity matching, which captures implicit language patterns better than linear decision boundaries. We fine-tune an ERNIE-3.0 encoder with SoftTriple loss and apply ik/iNN retrieval for binary detection (Subtask 1). For multi-label categorization (Subtasks 2 and 3), we transfer learned representations from the detection model and fine-tune with Asymmetric Loss. A priority-based stratified cross-validation strategy ensures minority classes appear across all training folds despite extreme label skew. Evaluated on the official 1,927-sample test set using an end-to-end pipeline, our system achieved Macro-F1 scores of 0.9190 (Rank 6) on Polarization Detection, 0.8244 (Rank 5) on Type Classification, and 0.6670 (Rank 4) on Manifestation Identification.
dutirshlee at SemEval-2026 Task 11: Symbolic Augmentation for Content-Bias-Resistant Syllogistic Reasoning
Songhuan Li | Liang Yang | Shengdi Yin | Qiang Zhang | Hongfei Lin
Songhuan Li | Liang Yang | Shengdi Yin | Qiang Zhang | Hongfei Lin
We describe our system for SemEval-2026 Task 11 Subtask 1 (English syllogistic validity). Our approach fine-tunes Qwen2.5-7B-Instruct with LoRA and a symbolic data augmentation (SDA) scheme that replaces real-world entities with abstract placeholders, explicitly decoupling logical form from content. The resulting model achieves 96.34% accuracy and a total content effect (TCE) of 2.15, yielding a primary score of 44.86. We provide detailed ablations and negative results (prompting, self consistency, contrastive decoding, structured chain-of-thought, andDPO)tocharacterizewhy direct LoRA training with SDA is the most ro bust configuration for this task. Finally, we use a specialist–generalist complementarity setting where a strong API model (ACC 99.48, TCE 1.06, score 57.68) is corrected by the SDA spe cialist on a single disagreement, producing a merged output with ACC 100 and TCE 0.
SU NLP 29 at SemEval-2026 Task 5: DynaOrd - Hybrid Dynamic Ordinal Regression with LoRA-Fine-Tuned DeBERTa-v3
Musab Khan
Musab Khan
We describe our system submitted to SemEval-2026 Task 5 on rating the plausibility of word senses in ambiguous sentences within narrative contexts. The task requires predicting human-perceived plausibility scores on a 1-5 scale for candidate word meanings embedded in short stories, posing challenges such as limited training data and the ordinal nature of target labels. Our approach combines a DeBERTa-v3-large encoder with Low-Rank Adaptation (LoRA) and a dynamically weighted hybrid CORAL-MSE loss for ordinal regression. This formulation adapts the contribution of ranking and regression objectives during training, prioritizing ordinal consistency early and regression refinement in later epochs.We analyze the contributions of dynamic loss weighting to overall system performance.
Khaleesiyali at SemEval-2026 Task 2: Lexicon-Augmented RoBERTa for Valence–Arousal Regression on Ecological Essays
Eleale Tee
Eleale Tee
This paper presents a lexicon-augmentedRoBERTa system for the SemEval-2026 Task2 valence–arousal regression challenge. Themodel integrates deep contextual embeddingswith a 6-dimensional feature vector derivedfrom the NRC VAD lexicon, achieving a hightoken coverage rate of 72.05%. Under officialuser-aware evaluation, the system reached acompetitive average composite correlation of0.547, significantly outperforming the ridgeregressionbaseline. The system demonstratedparticular robustness in valence (r = 0.656)and achieved strong generalization to unseenusers (rarousal = 0.519). These findings indicatethat lightweight lexicon-based statisticsprovide valuable complementary cues for longitudinalemotion modeling in modern transformerarchitectures.
UKPPsycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text
Darya Hryhoryeva | Amaia Zurinaga | Hamidreza Jamalabadi | Iryna Gurevych
Darya Hryhoryeva | Amaia Zurinaga | Hamidreza Jamalabadi | Iryna Gurevych
This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics.
EcoAffectTrack at SemEval-2026 Task 2: A Hierarchical DeBERTa-Transformer Framework with CCC Optimization for Longitudinal Affect Modeling
Diya Satish Kumar | Om Joshi
Diya Satish Kumar | Om Joshi
This submission proposes a hierarchical framework for longitudinal affect modeling, specifically designed for predicting variations in emotional valence and arousal over time. The system utilizes a DeBERTa-v3 encoder backbone optimized with a differentiable Concordance Correlation Coefficient (CCC) Loss for affect assessment (Subtask 1). This approach prioritizes capturing the "shape" and trend of emotional trajectories over absolute point-wise accuracy, yielding a significant performance gain over standard Mean Squared Error.For state change forecasting (Subtask 2A), the framework employs a Transformer-based temporal forecaster with positional encoding to account for inter-subject variability in emotional baselines. Disposition profiling (Subtask 2B) is addressed using a deep attention network that aggregates historical embeddings to identify emotionally informative essays. Experimental results from the official competition indicate that aligning the loss function with evaluation metrics and utilizing task-specific temporal modeling are essential for robust performance in longitudinal emotion recognition.
Momentum at SemEval-2026 Task 2: LongVA-RoBERTa, a transformer-Based Longitudinal Valence and Arousal Modeling
Supriya Nadiger | Sunil Saumya | Rahul Pujari | Veeresh Hiremath | Kiran Chikaraddi | Anoop Kadkol
Supriya Nadiger | Sunil Saumya | Rahul Pujari | Veeresh Hiremath | Kiran Chikaraddi | Anoop Kadkol
This paper studies the emotion as affective circumplex model representing valence and arousal in continuous two dimensional space. It also explores the disposition of emotion over time to identify the behavioural cues and self-identified affective states. while traditional methods use categorical emotion classes, SemEval 2026 Task 2 studies emotions in continuous space. In this paper, we proposes a transformer-based LongVA-RoBERTa model for emotion modeling in regression for ecological essays. For subtask 1 , we develop an affect prediction framework employing RoBERTa with attention pooling and a regression head for valence and arousal prediction. In subtask 2A , we employ BiLSTM to capture the temporal dependencies and fuse surface, contextual, user-level features to predict short-term affect variation. Our results outperform the baseline, paving ways to continue emotion prediction in continuous dimensional space
LexMachina at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Somdev Ganguli | Vibhan Dutta | Romit Datta | Amit Barman | Sudip Naskar
Somdev Ganguli | Vibhan Dutta | Romit Datta | Amit Barman | Sudip Naskar
Tracking emotional dynamics like valence and arousal is critical for understanding users’ affective baselines in ecological text. However, encoder models often struggle to distinguish stable user traits from dynamic shifts, leading to poor generalization. This paper presents LexMachina, our system for SemEval-2026 Task 2, addressing "domain shift" and "regression to the mean." LexMachina utilizes a DeBERTa-v3-Base backbone with a bifurcated strategy: post-hoc Isotonic Regression for valence calibration and a Domain Adversarial Neural Network (DANN) to mitigate user-bias in arousal. LexMachina achieved composite scores of r=0.645 (Valence) and r=0.434 (Arousal), demonstrating that adversarial disentanglement effectively captures nuances in longitudinal affective data.
lakshadvani at SemEval-2026 Task 11: A Neuro-Symbolic Approach to Content-Independent Syllogistic Reasoning
Laksh Advani
Laksh Advani
We describe our system for SemEval-2026 Task 11 on disentangling content from formal reasoning. The content effect in syllogistic reasoning, where models judge validity based on conclusion plausibility rather than logical structure, persists even with explicit instructions to ignore real-world knowledge. We find that this bias is better addressed structurally than through prompting: by restricting the LLM to a translation role (mapping natural language to abstract variables) and delegating all deductive reasoning to a deterministic checker over the 24 valid Aristotelian forms, we eliminate content bias entirely on Subtask 1 (100.0 combined, TCE=0.0, 4th place).Our Subtask 2 system, which lacks this separation, scores 41.08 (7th place) despite 95.26% accuracy and 99.47% premise retrieval F1, because a TCE of 2.94 incurs a 58% penalty. A three-way ablation on training data using GPT-5 confirms the pattern:Vanilla LLM: 78% accuracy / TCE=19LLM + Aristotelian Rules in Prompt: 90% accuracy / TCE=5LLM + Symbolic Checker: 97% accuracy / TCE=3
CUETLuminaries0227 at SemEval-2026 Task 13: Invariance-Oriented Representation Learning for Robust AI-Generated Code Detection
Shiti Chowdhury | Adnan Faisal
Shiti Chowdhury | Adnan Faisal
Large language models increasingly generate high-quality source code, making reliable detection of machine-generated code essential for maintaining authorship integrity and software accountability. However, detection systems often degrade under distribution shift, particularly across programming languages and application domains. SemEval-2026 Task 13 Subtask A addresses this challenge through a structured OOD evaluation framework that assesses binary machine-generated code detection across unseen languages and application domains. To mitigate this limitation,we propose a robustness-oriented framework that enhances feature-fused UniXcoder representations with supervised contrastive learning, adversarial language-invariant training and uncertainty-aware filtering to promote stable and shift-resilient representations. Our proposed system achieves a macro-F1 of 0.5411 on the official test set and maintains stable performance under severe language–domain shift. Our results demonstrate that domain-level semantic variation is the primary source of degradation under distribution shift, reinforcing the importance of invariance-oriented representations for stable OOD performance
DualAxis AI at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis
Yahya Missaoui | Solomon Kebede | Mounika Marreddy | Alexander Mehler
Yahya Missaoui | Solomon Kebede | Mounika Marreddy | Alexander Mehler
Dimensional Aspect-Based Sentiment Analy-sis models sentiment using continuous valenceand arousal scores instead of discrete polaritylabels, enabling fine-grained affect representa-tion at the aspect level. SemEval 2026 Task3 defines this setting through three subtaskscovering aspect-level regression and structuredextraction of aspect–opinion pairs with continu-ous scoring. We implement transformer-basedbaselines for all subtasks within a unified, re-producible framework. For aspect-level regres-sion, we fine-tune pretrained encoders in anaspect-conditioned setup to predict valence andarousal. RoBERTa-large achieves the best de-velopment performance, with average RMSEsof 0.884 (restaurant) and 0.789 (laptop).
DUTH at SemEval-2026 Task 1: Prompt-Based Zero-Shot Large Language Models for Constrained Multilingual Humor Generation
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
Humor generation is a challenging problem fornatural language processing systems due to itssubjectivity, cultural dependence, and relianceon creative language use. These challenges arefurther amplified in constrained multilingualsettings, where models must satisfy explicitlexical or topical requirements while producingshort and humorous outputs.In this paper, we present DUTH’s system forSemEval-2026 Task A on constrained multilingual joke generation in English, Spanish, andChinese. Our approach leverages instructiontuned large language models in a zero-shot setting, combining prompt engineering, controlleddecoding, and lightweight post-generation validation to enforce constraint satisfaction andlanguage consistency. We evaluate multiplemodel families and parameter scales, includingQwen and Mistral models. Human evaluationdemonstrates that larger models consistentlyoutperform smaller ones, with Qwen2.5-14BInstruct achieving the strongest overall performance. Error analysis highlights remainingchallenges such as lexical constraint violationsand cross-lingual interference.
Ambirig at SemEval-2026 Task 5: Distributional Ordinal Modelling for Ambiguous Word Senses in Narrative Contexts
Soumyajit Roy
Soumyajit Roy
Word Sense Disambiguation (WSD) has traditionally been framed as selecting a single correct sense given context. However, natural language understanding by humans often involves ambiguity, underspecification, and graded plausibility judgments rather than categorical decisions. SemEval-2026 Task 5 explicitly targets this gap by requiring systems to predict human-perceived plausibility scores for word senses in short narratives. In this paper, we present a systematic empirical study of modelling plausibility as an ordinal distribution prediction problem. We hypothesise that standard classification objectives fail to capture the ordinal nature of human uncertainty in this domain. While we experimented with complex auxiliary tasks, including Siamese networks, Task-Adaptive Pretraining (TAPT), and transfer learning from Natural Language Inference (NLI), our results show these approaches fail in low-resource settings. Instead, we propose a streamlined architecture based on DeBERTa-v3-base utilising a GlossBERT-style Cross-Encoder optimised with Earth Mover’s Distance (EMD) loss. By modeling the problem as ordinal regression over a probability distribution and enriching inputs with prototypical examples, our system achieves an accuracy of 73% and Spearman correlation of 0.593, establishing a robust baseline that outperforms complex parameter-heavy approaches.
DUTH at SemEval-2026 Task 3: Multilingual Transformer Models for Dimensional Stance Prediction Across Tracks
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
This paper presents DUTH, our system forTrack A and Track B of SemEval-2026 Task 3on Dimensional Sentiment Analysis, focusing on the Dimensional Aspect-Based Sentiment Regression (DimASR) subtask. DimASRrequires predicting continuous Valence andArousal (VA) scores for aspect terms in opinionated text and stance targets in public-issuediscourse.Our approach uses a multilingual Transformerencoder fine-tuned end-to-end to jointly encodethe input text and its corresponding aspect orstance target, followed by a regression head forVAprediction. We evaluate DUTH on the official multilingual and multidomain datasets andcompare it against the shared-task baselines.Results show competitive performance, withimprovements over the strongest official baseline in Track A and over the mBERT baselinein Track B, while yielding consistently strongerpredictions for Valence than for Arousal.
DUTH at SemEval-2026 Task 9: Joint Multilingual Fine-Tuning for Online Polarization Detection
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
Online polarization on social media presentssubstantial challenges for public discourse, content moderation, and large-scale social analytics across diverse linguistic and cultural contexts. A recent multilingual benchmark enablessystematic evaluation of polarization detectionacross 22 languages and multiple sociopoliticalevents, providing a unified setting for studying socially grounded NLP under multilingualconditions.Wepresent DUTH, a unified multilingual system for binary polarization detection based onjoint fine-tuning of XLM-RoBERTa on the 22languages of SemEval-2026 Task 9 Subtask1. Our system uses a single shared encoderwith a linear classification head and is trainedjointly on the multilingual training set usingmixed-precision optimization. On the officialevaluation, the system achieved an average Accuracy of 0.822 and an average Macro-F1 of0.780 across 22 languages. The results showthat a simple jointly fine-tuned multilingualtransformer provides a competitive and scalable baseline for online polarization detection,while still facing difficulties in implicit, sarcastic, and culturally grounded cases.
UAlberta at SemEval-2026 Task 2: Temporal Fusion Models for Predicting Affect Over Time
Duc Ho | Khanh Bui | Daniela Teodorescu | Grzegorz Kondrak
Duc Ho | Khanh Bui | Daniela Teodorescu | Grzegorz Kondrak
We describe our systems for the SemEval 2026 Task 2 on Predicting Variation in Emotional Valence and Arousal from Ecological Essays. To predict affect in a single instance, and for forecasting dispositional change, we use embeddings from a language model and a Recurrent Neural Network. To predict state changes from a previous timestep to the next, we use time-series forecasting. Our systems ranked first for forecasting dispositional change, and third for forecasting state change over time. We make our code publicly available.
NLP-FSDM at SemEval-2026 Task 4: Narrative Similarity via Multiple Negatives Ranking and Instruction-Based Embeddings
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
The identification of narrative similarity is a complex NLP challenge that requires modeling deeper plot and thematic alignment rather than relying solely on lexical overlap. In this paper, we detail the participation of team NLP-FSDM in SemEval-2026 Task 4. Our approach utilizes the bge-large-en-v1.5 encoder. For Track A, we fine-tune it using Multiple Negatives Ranking Loss (MNRL), while for Track B we rely on the pretrained encoder to generate fixed narrative representations. We achieved an accuracy of 65.50% in Track A and 62.50% in Track B. This paper provides an extensive comparison of our results with competitive baselines and top-performing systems, analyzing the efficacy of dense encoders in low-resource narrative contexts.
AI4PC-Howard University at SemEval-2026 Task 2: Fine-Tuning DistilBERT, DeBERTa and ModernBERT for Valence–Arousal Prediction and Change Estimation
Araj Shah | Utsav Shah | Saurav Aryal
Araj Shah | Utsav Shah | Saurav Aryal
We present lightweight, reproducible models for longitudinal valence–arousal (VA) prediction in the SemEval-2026 Task 2 essay corpus. Using only the official data, we enforce user-disjoint splits to prevent leakage and evaluate three settings: essay-level VA state estimation, short-horizon VA change forecasting, and long-horizon disposition change prediction. Our submitted systems use DistilBERT for essay-level regression, ModernBERT-based history modeling with a GRU and a blended previous-delta baseline for short-horizon change, and pooled DeBERTa history embeddings with a compact MLP for disposition change. On the official evaluation, across our best performing approaches, we achieve rcomp =0.665/0.468 (valence/arousal) for Subtask 1, r = 0.597/0.413 for Subtask 2A, and r =0.046/0.348 for Subtask 2B.
Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering
Hadi Bayrami Asl Tekanlou | Mahdi Bakhtiyarzadeh | Jafar Razmara
Hadi Bayrami Asl Tekanlou | Mahdi Bakhtiyarzadeh | Jafar Razmara
We propose a region-aware hybrid retrieval framework for culturally grounded multilingual question answering. Our system combines BM25-based lexical matching with dense semantic similarity using sentence embeddings, integrating both signals into a unified ranking function. To further prioritize culturally relevant evidence, we introduce a regional weighting heuristic that boosts documents containing explicit region-specific references. The top-ranked evidence passages are incorporated into a structured prompt and processed by a 4-bit quantized Qwen3-14B model. Instead of generating free-form text, the model selects answers deterministically using a logit-based scoring mechanism over the four multiple-choice options. This design enables efficient inference while improving cross-lingual stability, particularly in culturally explicit contexts.
lamanhnguyen at SemEval-2026 Task 2: Uncovering Lexical Bias and Momentum Lag in Longitudinal Emotion Prediction using Multi-task DeBERTa
Lam Anh Nguyen
Lam Anh Nguyen
This paper describes our system for SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal. We approached the task by fine-tuning a weighted ensemble of DeBERTa-v3-base models. Our system achieved the second-highest Valence composite correlation and ranked 5th in the overall V&A average in Subtask 1. More importantly, we provide an empirical analysis of our model’s performance on longitudinal tasks, where it exhibited significant inverse cor- relations. We quantify the Venting Effect, showing a systematic tendency for the model to over-index on negative lexical cues despite self-reported relief. Furthermore, we analyze the structural trade-off between Mean Absolute Error and Pearson correlation induced by smoothing techniques.
VerbaNex AI at SemEval-2026 Task 2: DeBERTa for Longitudinal Valence and Arousal Prediction
Melissa Moreno | Juan Carlos Martinez Santos | Edwin Puertas
Melissa Moreno | Juan Carlos Martinez Santos | Edwin Puertas
This paper describes our submission to SemEval 2026 Subtask 1: Longitudinal Affect Assessment, which aims to predict continuous valence and arousal scores from chronologically ordered texts. Implement two regression based configurations built on DeBERTa fine tuning: a contextual model and a hybrid model that incorporates normalized lexical features derived from the NRC VAD lexicon. Both systems preserve temporal ordering and apply user level data splits to ensure generalization to unseen individuals. Results show competitive performance, with stronger outcomes in valence than in arousal. The integration of lexical features does not yield consistent improvements for arousal, highlighting the difficulty of modeling emotional intensity dynamics. Error analysis indicates challenges in handling implicit emotions, pragmatic ambiguity, and subtle affective shifts over time. Overall, findings underscore the importance of combining contextual representations with structured lexical knowledge while addressing longitudinal variability in emotional activation.
We describe the PALI system submitted to SemEval-2026 Task~3 (Dimensional Aspect-Based Sentiment Analysis), which requires predicting valence–arousal (VA) scores and extracting structured sentiment tuples across multiple languages.Our final system centers on LoRA fine-tuning of Qwen3-32B using Llama-Factory, together with data conversion/cleaning, multilingual data-mixing strategies, and inference-time validation and repair.We additionally explored retrieval-based few-shot prompting with BGE-M3, but found it less effective for learning consistent VA scoring preferences.On Track~A, our final system uses per-language LoRA adapters that mix all subtasks per language for a better trade-off between performance and efficiency.On the official test set, we achieve average per-language scores of 1.2071 RMSE\VA for Subtask~1 and 0.5641/0.4905 cF1 for Subtask~2/3.On the development set, we find that per-language-per-task adapters further improve extraction cF1 but are less attractive in terms of training and deployment cost.For Track~B, we report results for VA prediction on five languages and two domains.
Duluth at SemEval-2026 Task 6: DeBERTa with LLM-Augmented Data for Unmasking Political Question Evasions
Ted Pedersen
Ted Pedersen
This paper presents the Duluth approach toSemEval-2026 Task 6 on CLARITY: Unmask-ing Political Question Evasions. We addressTask 1 (clarity-level classification) and Task 2(evasion-level classification), both of which in-volve classifying question–answer pairs fromU.S. presidential interviews using a two-leveltaxonomy of response clarity. Our system isbased on DeBERTa-V3-base, extended withfocal loss, layer-wise learning rate decay, andboolean discourse features. To address classimbalance in the training data, we augmentminority classes using synthetic examples gen-erated by Gemini 3 and Claude Sonnet 4.5. Ourbest configuration achieved a Macro F1 of 0.76on the Task 1 evaluation set, placing 8th outof 40 teams. The top-ranked system (TeleAI)achieved 0.89, while the mean score across par-ticipants was 0.70. Error analysis reveals thatthe dominant source of misclassification is con-fusion between Ambivalent and Clear Replyresponses, a pattern that mirrors disagreementsamong human annotators. Our findings demon-strate that LLM-based data augmentation canmeaningfully improve minority-class recall onnuanced political discourse tasks.
HCMUSDroneBoys at SemEval-2026 Task 11: Asymmetric Counterfactual Debiasing and Rank-Sensitive Logical Invariance Adaptation for Syllogistic Reasoning
Nguyen Tran | Duy Minh Dao Sy | Trung Kiet Huynh | Phu Hoa Pham | Phu Quy Nguyen Lam
Nguyen Tran | Duy Minh Dao Sy | Trung Kiet Huynh | Phu Hoa Pham | Phu Quy Nguyen Lam
This paper describes our system for SemEval-2026 Task 11, Subtask 1: binary classification of syllogistic validity in English. The main challenge is the content effect, where language models confuse formal logical validity with how plausible the argument sounds. We propose three techniques that work together to separate logical form from semantic content: (1) Structure-Disentangled Prompting (SDP), which breaks syllogisms into premise-conclusion triples and uses a logic-first instruction template; (2) Asymmetric Counterfactual Debiasing (ACD), a data augmentation method that only generates valid-to-invalid counterfactual pairs, taking advantage of an asymmetry in validity composition to avoid label noise; and (3) Rank-Sensitive Logical Invariance Adaptation (RLIA), where we find that low-rank QLoRA adapters cannot simultaneously learn classification and suppress content-correlated shortcuts, and solve this by increasing adapter rank. Built on Qwen2.5-14B-Instruct, our system achieved a perfect Combined Score of 100.0 on the SemEval-2026 Task 11 Subtask 1 benchmark.
YNU-HPCC at SemEval-2026 Task 1: Constraint-Aware In-Context Learning for Multilingual Humor Generation
Xulong Zhang | Jin Wang | Xuejie Zhang
Xulong Zhang | Jin Wang | Xuejie Zhang
This paper describes the system developed by the YNU-HPCC team for SemEval-2026 Task 1 (Humor Generation). The task aims to generate humorous texts from given news headlines or from two unrelated words. The core challenge lies in enabling Large Language Models (LLMs) to understand human humor and align with specific humorous styles. We investigated two approaches: fine-tuning with Proximal Policy Optimization (PPO) and in-context learning with LLMs. We also employed Qwen-Max to evaluate the quality of the generated texts. In the PPO experiments, we constructed a hybrid reward model to align with humor. For our final submission based on LLMs, we used multiple advanced LLMs, along with customized few-shot prompts and a small set of gold samples, to effectively guide the models in generating jokes that resonate with human humor. Experimental results show that our system achieves competitive performance, ranking 4th in the English track, 2nd in the Chinese track, and 2nd in the Spanish track.
Perspicere at SemEval-2026 Task 2: Modeling Longitudinal Valence and Arousal via Dense Embeddings and Agentic Reasoning
Kamyar Moradian Zehab | Mohammad Sadegh Poulaei | Nasser Mozayani
Kamyar Moradian Zehab | Mohammad Sadegh Poulaei | Nasser Mozayani
This paper presents our system for SemEval 2026 Task 2 (Subtask 1), modeling affect assessment as a longitudinal trajectory. We evaluate a tripartite affective framework of escalating contextual complexity, spanning zero-context feature extraction, latent temporal modeling via LSTM, and explicit semantic reasoning via the Teacher-Guided Clinical Reasoning Agent utilizing in-context learning. Our results show that robust static extraction outperforms explicit sequence modeling. Specifically, Matryoshka-distilled embeddings (Jasper) paired with XGBoost provided the best balance of speed and accuracy when utilizing the full training corpus (Valence composite r = 0.654, a 17.4% improvement compared with the baseline), mitigating the severe overfitting observed on partitions of the dataset. Additionally, we uncover a distinct agentic advantage: although the reasoning agent trailed mathematical regressors in tracking high-frequency fluctuations, its SOTA psychological profiling yielded the highest Between-User Valence correlation (r = 0.725), demonstrating its efficacy in user-level affective profiling. Finally, a persistent "arousal bottleneck" confirms the limitations of text-only modeling for physiological activation.
McMaster NLP at SemEval-2026 Task 2: A Lightweight Multi-Feature System for Predicting Emotional Valence and Arousal over Time
Hongyi Zhang | Daniel Hu | Allison Lahnala
Hongyi Zhang | Daniel Hu | Allison Lahnala
We present a lightweight, feature-based regression system for predicting \textbf{valence} (pleasantness) and \textbf{arousal} (activation) from longitudinal language data. The language data ranges from longer free-form ecological essays to short affect-word, organized by user and time, reflecting natural variation in affective expression and experience. Our approach combines three complementary signals: (i) sentence-level semantic embeddings, (ii) psycholinguistic category features capturing affect- and function-related word usage, (iii) similarity measures between the language data with archetypal sentences, and (iv) trainable user-embeddings to account for between-user differences. The resulting feature vector is passed to a multi-layer perceptron trained to jointly predict valence and arousal. Our design provides a strong and interpretable baseline by making it possible to isolate the contribution of semantic, psycholinguistic, similarity, and user-specific signals. We further analyze our model’s predictions to identify which feature groups are most informative and where errors are concentrated across users and input types.
YNU-HPCC at SemEval-2026 Task 9: Hybrid Augmentation and Regularization Strategies for Multilingual Polarization Type Classification
Di Bao | Jin Wang | Xuejie Zhang
Di Bao | Jin Wang | Xuejie Zhang
This paper introduces a system based on fine-tuned pretrained language models, which is constructed for SemEval 2026 Task 9: Multilingual Polarization Type Classification. The task aims to perform multi-label polarization classification on texts covering 22 languages, identifying five types of polarization: political, racial/ethnic, religious, gender/sexual, and others. The main challenges of the task lie in handling uneven data distribution across languages, extreme class imbalance, and the complexity of cross-lingual semantic understanding. To address these challenges, a training framework integrating hybrid augmentation and multi-strategy regularization is proposed. Based on XLM-RoBERTa-large, the framework combines feature-space Mixup augmentation, an asymmetric loss function, adversarial training, and exponential moving average. Multi-label decisions are made through dynamic threshold optimization. Experimental results show that the proposed method achieves a macro-F1 score of 0.48 on the validation set, effectively improving classification performance and generalization capability in multilingual and imbalanced scenarios.
Paradise at SemEval-2026 Task 12: Leveraging Instruction-Tuned Large Language Models with Chain-of-Thought Prompting for Abductive Event Reasoning
Dhruv Goyal | Ishita Gupta | Jatin Bedi
Dhruv Goyal | Ishita Gupta | Jatin Bedi
We present Paradise, our system for SemEval-2026 Task 12: Abductive Event Reasoning, which identifies plausible direct causes of real-world English-language events using retrieved contextual documents. Our approach employs Qwen2.5-7B-Instruct, a 7-billion-parameter instruction-tuned language model combined with carefully engineered chain-of-thought prompting, requiring no task-specific fine-tuning or training-data supervision (prompt components were selected using the development set). The system achieves a score of 0.79 on the official 612-instance test set by integrating explicit causal-inference rules, 4,000-character document context windows, and greedy decoding. Analysis reveals that conservative prediction patterns, 87.1% single-label and 36.9% Option D, effectively exploit the asymmetric scoring metric. Ablation studies confirm that document context contributes +6.4 points, chain-of-thought reasoning +5.3 points, and explicit causal rules +3.1 points to development performance. Our code is publicly available at https://github.com/DhruvGoyal404/semeval2026-task12.
Paradise at SemEval-2026 Task 5: On the Limitations of Surface-Level Features for Graded Word Sense Plausibility Prediction
Dhruv Goyal | Ishita Gupta | Jatin Bedi
Dhruv Goyal | Ishita Gupta | Jatin Bedi
This paper introduces a simple approach for predicting how plausible a word sense is in short narratives where meaning is ambiguous. We use 13 hand-crafted features, including text statistics, word-level similarity computed using basic set-based comparisons, and measures of annotator disagreement. Five diverse and largely independent traditional machine learning models are combined using a weighted ensemble with minimal tuning. Despite theoretical grounding in classical disambiguation methods, our system achieves essentially random performance, with Spearman correlation (ρ) of −0.038 and accuracy within standard deviation of 0.542 on the official test set. This result demonstrates that surface-level lexical features, while interpretable, are insufficient for graded sense plausibility prediction without deep semantic representations. By selecting features inspired by classical word sense disambiguation techniques and incorporating signals derived from human disagreement, our model produces plausibility predictions that are largely interpretable. This negative result provides important baselines and insights for future work on graded word sense disambiguation.
ES4MLL at SemEval-2026 Task 2: Set Attention Aggregation and Recurrent Temporal Modeling for Longitudinal Affect Prediction
Andrea Lolli | Chiara Lunazzi | Riccardo Coppola | Flavio Giobergia
Andrea Lolli | Chiara Lunazzi | Riccardo Coppola | Flavio Giobergia
Longitudinal modelling of affect from text requires capturing both linguistic content and temporal emotional dynamics. SemEval-2026 Task 2 introduces a dataset of essays and feeling words annotated with self-reported valence and arousal scores. In this work, we propose a neural architecture that combines pretrained Transformer encoders with temporal sequence modelling to predict continuous valence and arousal over user-specific timelines. Individual texts are encoded using a Transformer-based language model and aggregated through attention-based pooling before being processed by recurrent layers to capture longitudinal dependencies. To adapt pretrained representations under limited data conditions, we explore parameter-efficient fine-tuning strategies. We make the code available at https://github.com/AndreaLolli2912/SemEval2026-EmoVA.
TTLab at SemEval-2026 Task 10: Transformer-based Approaches for Psycholinguistic Conspiracy Detection in Social Media Discourse
Samuel Richter | Mounika Marreddy | Alexander Mehler
Samuel Richter | Mounika Marreddy | Alexander Mehler
Online platforms increasingly host conspiracy narratives that shape public debate, reduce trust in institutions, and contribute to polarization, highlighting the need for reliable automatic detection systems. In this paper, we participate in SemEval-2026 Task 10 (PsyCoMark), focusing on conspiracy detection in Reddit conversations using transformer-based models. We evaluate four approaches: raw text, structured psycholinguistic markers, a combined representation, and a stacking ensemble. Our results show that marker-based representations outperform text-only models, and that ensembling further improves robustness. These findings demonstrate the value of incorporating structured psychological cues for scalable conspiracy detection.
Our system for SemEval-2026 Task 1 Subtask A addresses constrained text-based humor generation in English. The approach relies on structured prompt engineering using a GPT-4–class large language model in a zero-shot setting without task-specific fine-tuning. Each input, consisting of either mandatory word pairs or a news headline, is embedded into a fixed instruction template enforcing strict stylistic and structural constraints.The system ensures single-sentence outputs between 8–12 words, adopts a dry and deadpan tone, and incorporates subtle expectation shifts while avoiding exaggerated punchlines or unsafe content. Deterministic decoding guarantees replicability, and an automatic validation step enforces compliance with official submission requirements.Experimental results show that structured prompting significantly improves stylistic alignment compared to unconstrained generation. The system demonstrates that controlled humor generation can be achieved through constraint-based prompt design without additional training.
YNU-HPCC at SemEval-2026 Task 10: Pretrained DistilBERT Models for Conspiracy Marker Extraction and Detection
Junpei Chen | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Junpei Chen | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
In this paper, we present our submission to the SemEval-2026 Psycholinguistic Conspiracy Shared Task (Task 10), which consists of two tasks: conspiracy marker extraction and conspiracy detection. For conspiracy marker extraction, we formulate the problem as a token classification task and fine-tune pretrained language models, achieving performance above the official baseline and ranking 6th on the final leaderboard. For conspiracy detection, we apply data preprocessing, regularization, and ensemble inference strategies,resulting in improvements over the baseline and a 10th-place ranking. Overall, our results demonstrate the effectiveness of pretrained language models for both tasks.
SemEval 2026 task 5 asks us to provide a pro-gram to try to match the human ratings of sense-appropriateness of a particular word in a seriesof very structured, very short stories.Our system1 associates a fixed list of 50 wordswith each WordNet synset, and computes sev-eral scores for each of the phrases in the story,to determine how closely the phrase matchesthe wordlist.We received near-chance results, in spite ofseveral different approaches to building andemploying sets of word-lists. The stories inthis dataset are designed to be ambiguous, andevery story contains words associated with atleast two senses of the target word. We nowbelieve that our system’s approach is inappro-priate for this dataset.
zhangpeng at SemEval-2026 Task 10: PsyCoMark - Psycholinguistic Conspiracy Marker Extraction and Detection
Zhang Peng | Lu Gehao
Zhang Peng | Lu Gehao
We describe our system for SemEval-2026 Task 10 on psycholinguistic conspiracy marker extraction and conspiracy detection from English texts. The shared task consists of two subtasks: (1) extracting conspiracy-related markers—actor, action, effect, victim, and evidence—evaluated using an overlap-based macro F1-score, and (2) detecting conspiracy content as a binary text classification problem evaluated using macro-averaged F1-score. Our approach relies on fine-tuning pre-trained transformer encoders, including multilingual DistilBERT variants and DeBERTa-v3, without using external corpora or data augmentation techniques. Experimental results show that our best models achieve a macro-F1 score of 0.1476 for Subtask~1 and a Weighted-F1 score of 0.7267 for Subtask~2. These results show that simple fine-tuning of pre-trained models provides a strong baseline for both marker extraction and conspiracy detection.
AIvengers at SemEval-2026 Task 9: Utilizing Language Specific Encoders for Multilingual Text Classification
Boon Elschenbroich | Lars Britz
Boon Elschenbroich | Lars Britz
Polarizing language has evolved from a social media phenomenon into a pervasive feature of public and everyday discourse across cultures and geographies. And, this is not limited to certain countries, but a world wide trend. As we will show, detecting polarization, it’s type and manifestation is not a simple task for one ML model, but, it requires multiple different approaches depending on the language and culture. In this paper, we provide the best methods that we found for each language in all three SemEval 2026 - Task 9 multilingual text classification challenge subtasks. We achieved the best results with language specific pre-trained BERT and RoBERTa models, rather than using a general approach and using a generic multi-language model. Our approach secured a high to medium rank in all subtasks and languages.
SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Deshan Sumanathilaka | Nicholas Micallef | Julian Hough | Saman Jayasinghe
Deshan Sumanathilaka | Nicholas Micallef | Julian Hough | Saman Jayasinghe
Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored.SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions.
BAHAHA at SemEval-2026 Task 1: Benchmarking-Aware Humor Authoring with Hybrid Assessment and Adaptation
Utsav Arora | Andrew Hoblitzell
Utsav Arora | Andrew Hoblitzell
This paper describes the BAHAHA system for SemEval-2026 Task 1: MWAHAHA, which requires generating original jokes given either a news headline or a pair of rare words. Our approach uses a generate-then-rank pipeline, combining multi-style candidate generation via comedian-inspired few-shot prompting. We perform quality assessment from a smaller model fine-tuned on synthetic rating data from the generation model. Specifically, we produce up to 50 candidates per input across 15 stylistic templates and select outputs through a mixed-initiative interface that combines automated ranking with human judgment. There were 305 participants and 180 submissions in the contest. Our system ranks 2nd on Subtask A Chinese and 5th on Subtasks B1 and B2. The system generates jokes natively in each language rather than through translation.
dangphuduy at SemEval-2026 Task 10: Span-based Conspiracy Marker Extraction and Emotion-Aware Detection via Gated Fusion
Phu Duy Dang
Phu Duy Dang
Conspiracy theories on social media pose significantsocietal risks, making it essential todetect both conspiracy-related content and thetextual spans that serve as conspiracy markers.In this work, we propose two effective methodsto address these challenges. For markerextraction, we develop a span-based slidingwindow framework that improves efficiencyand accuracy by focusing on localized context.In addition, inspired by the distinctive emotionalpatterns in conspiracy texts, we designa dynamic gating mechanism to integrate emotionaland semantic representations. We evaluateour methods on the SemEval 2026 Task 10,where our team (dangphuduy) achieved competitiveresults, ranking 4th in Task 1 (SpanExtraction) and 3rd in Task 2 (Conspiracy Detection).Experimental results demonstrate thatboth proposed methods significantly enhancemodel performance.
Pixel Phantoms at SemEval-2026 Task 3: Language-Specific Transformer Regression for Dimensional Aspect-Based Sentiment Analysis
Jithu Morrison S | Abisha Rose S
Jithu Morrison S | Abisha Rose S
Aspect-Based Sentiment Analysis (ABSA) has traditionally relied on discrete polarity labels (positive, negative, or neutral) which fail to capture the continuous, multidimensional nature of human emotion. SemEval-2026 Task 3, Dimensional Aspect-Based Sentiment Analysis (DimABSA), addresses this limitation by requiring systems to predict continuous Valence (pleasantness) and Arousal (intensity) scores on a 1–9 scale for specific aspect terms in text, across 15 language–domain combinations in two tracks. Prior approaches to multilingual ABSA have largely depended on single generic multilingual encoders applied uniformly across languages, ignoring language-specific linguistic structures. The Pixel Phantoms system takes a language-aware strategy, selecting dedicated language-specific pre-trained transformer models for each language, including \url{cl-tohoku/bert-base-japanese-v3} for Japanese, \url{DeepPavlov/rubert-base-cased} for Russian, \url{bert-base-chinese} for Chinese, and a Davlan Swahili mBERT variant for Swahili, and falling back to \url{xlm-roberta-base} for morphologically complex low-resource languages such as Tatar and Ukrainian. All models share a common regression architecture: a dual-pooling head combining CLS and mean-pooled representations, trained with a composite MSE + MAE loss and aspect-prompted input formatting. We participated in both Track A (10 combinations) and Track B (5 combinations), with our strongest result in Japanese Hotel (rank 13/21, RMSE 0.7297) and competitive performance in Chinese restaurant (RMSE 0.9823 vs. Baseline Kimi-K2 Thinking 1.8959). We also analyze failure modes in low-resource languages and domain-shifted settings, highlighting where multilingual transfer remains brittle. Overall, the results indicate that language-specific encoders deliver consistent gains over generic multilingual baselines in dimensional sentiment regression.
Gradient Descenders at SemEval-2026 Task 9: Data-Centric Counterfactual Augmentation for Multi-Label Hate Speech Detection
Tran Nhan | Dang Thin
Tran Nhan | Dang Thin
In this paper, we describe the Gradient Descenders submission to SemEval-2026 Task 9 Subtask 2: Multi-Label Hate Speech Detection. Existing Transformer-based approaches often exhibit degraded performance on this task due to severe class imbalance and complex class intersectionality, leading to the learning of spurious correlations. To counteract this, we introduce a novel, data-centric counterfactual augmentation pipeline. We employ Large Language Models (LLMs) as semantic generators to synthesize diverse, targeted training samples via three distinct prompting strategies: Additive Label-Flipping (Attribute Injection), Context Decoupling, and Cross-Domain Identity Substitution. Fine-tuning a RoBERTa classifier on this augmented corpus significantly improves the model’s sensitivity to minority classes. Ultimately, our system achieves a Macro-F1 score of 44.15\% on the official test set, highlighting the efficacy of targeted LLM-based augmentation in highly imbalanced, multi-label environments.
SemEval-2026 Task 12: Knowledge Graph with hyperbolic embedding in Abductive Event Reasoning
Mingkai Wang | Varun Ojha | Huizhi Liang
Mingkai Wang | Varun Ojha | Huizhi Liang
This task introduces Abductive Event Reasoning (AER), a novel shared task, to investigate the ability of Large Language Models(LLMs) to reason about the causality of real-world events. More specifically, a data set consisting of different topics and choices is introduced, and we need to enable the model to select the best options for the given event. Three methods are separately introduced to explore thequestion, including the traditional natural language processing(NLP) method (DeBERTa), theenhanced knowledge graph(KG), and the KG embedded in hyperbolic space.
The system integrates a generative Large Language Model (Llama-3 8B, fine-tuned via LoRA) with a dual-expert bidirectional cross-encoder (DeBERTa-v3-large) optimized for both semantic similarity and Natural Language Inference (NLI). By aggregating these complementary models, the system effectively captures complex contextual dependencies. In the official test set, our architecture ranked 22nd out of 79 systems, achieving a Spearman Rank Correlation of 0.71 and an accuracy within the standard deviation of 82.04%.
SMASH at SemEval-2026 Task 9: Detecting Multilingual Polarisation with Encoder Ensembles and Calibrated Decision Thresholds
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
This paper describes the SMASH submission to SemEval-2026 Task~9 on multilingual, multicultural, and multi-event polarisation detection. The task comprises (i) binary polarisation detection, (ii) multi-label classification of polarisation types, and (iii) multi-label identification of polarisation manifestations across all available languages. We propose a language-adaptive ensemble framework combining monolingual and multilingual encoder-only transformers, together with a principled out-of-fold (OOF) threshold tuning strategy. Instead of relying on fixed probability thresholds, we jointly tune ensemble weights and class-wise decision thresholds to directly optimise macro-F1 under the official evaluation metric. Our experiments show that (1) monolingual encoders dominate in several high-resource languages but benefit from complementary multilingual signals, (2) no single multilingual backbone universally outperforms others across languages and subtasks, and (3) language-specific class threshold tuning substantially improves performance due to large cross-lingual variation in class distributions. Our results demonstrate that careful logit-level ensembling and threshold tuning provide strong performance for multilingual, imbalanced, multi-label polarisation detection. Across 22 evaluation languages, SMASH ranks among the top three systems in a substantial number of language–subtask pairs. Specifically, it ranks in the top three for 5 languages in Subtask 1, 14 languages in Subtask 2, and 16 languages in Subtask 3, demonstrating strong and consistent performance across diverse languages and tasks. Our system achieves average macro-F1 scores of 0.81, 0.62, and 0.53 for Subtasks 1, 2, and 3, respectively.
Lattice at SemEval-2026 Task 1: Why did the prompt engineer break up with their LLM? Because zero-shot was zero-fun.
Mathieu Dehouck | Olga Seminck | Marine Delaborde | Yoann Dupont | Noé Durandard
Mathieu Dehouck | Olga Seminck | Marine Delaborde | Yoann Dupont | Noé Durandard
This paper describes the contribution of theLattice Team to the humor generation MWA-HAHA Sem-Eval shared task on the Englishdata set for subtask A. During the developmentphase, we experimented with two different ap-proaches, but after a quick comparison of theoutputs, it turned out that one was clearly moresuccessful than the other. The winning strategycan be seen as consisting of two phases: first,we used a few-shot framework to let Deepseek-R1 32B generate multiple jokes based on theinput (headlines and word pairs). Second, weset up a voting protocol for Llama-3.1 8B torank the generated jokes and find the funniestone. The other strategy also consisted in twophases: first, we generate many more jokesin a zero-shot way with lighter, faster models,and then we turn back to ranking the generatedjokes, but since we have about ten time morejokes in this second setting, we follow a knock-out tournament procedure in order to find thebest jokes. Our Deepseek-R1 based model isone of the nine systems that shared a first placeon the English data set that received a total of32 valid submissions.
Comhis at SemEval-2026 Task 4: Embedding-Space Adaptation and LLM-Assisted Inference for Narrative Similarity
Ke Shu | Eetu Mäkelä | Mikko Tolonen
Ke Shu | Eetu Mäkelä | Mikko Tolonen
We present a two-stage system for the SemEval Narrative Similarity task that separates representation learning from comparative decision making. In Track B, we adapt a frozen large-scale embedding model using a lightweight projection layer trained with a triplet objective and hard example mining, producing a task-specific similarity space. In Track A, similarity scores derived from the adapted embedding space are incorporated into a large language model, which performs the final binary decision. On the official test set, our system achieves 0.68 accuracy on Track A and 0.66 on Track B, clearly outperforming the provided baselines and ranking 20th out of 44 teams on Track A and 10th out of 27 teams on Track B. These results demonstrate that efficient embedding adaptation combined with embedding-informed LLM reasoning is effective for modeling high-level narrative similarity.
FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova | Shiran Sun | Lifeng Han | Natalia Amat-Lefort | Flor Miriam Plaza-del-Arco
Liliia Bogdanova | Shiran Sun | Lifeng Han | Natalia Amat-Lefort | Flor Miriam Plaza-del-Arco
This system paper describes our participation in the SemEval-2025 Task-7 “Everyday Knowledge Across Diverse Languages and Cultures”. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ).The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo.Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform.We share the prompts we developed using refinement techniques and report the learning curve of such prompts.The tested languages are English, Spanish, and Chinese for both tracks.Our resources and codes are shared via \url{https://github.com/aaronlifenghan/FLANS-2026}
PolDeck at SemEval-2026 Task 9: Multilingual Online Polarization Detection via Hybrid Model Ensembling and Data Augmentation
Ben Grandy | Daniel Khir
Ben Grandy | Daniel Khir
In this paper, we address SemEval 2026 Task 9: Multilingual Online Polarization Detection. We present our hybrid ensemble framework, integrating few-shot prompting with Qwen3-30B, a native multilingual XLM-R encoder, and a translation-augmented DeBERTa encoder. To mitigate label imbalance, we implement a multi-stage augmentation pipeline leveraging LLMs for synthetic paraphrasing and cross-lingual translation. Our system ranked in the Top 10 on the English and German leaderboards, proving that integrating both high-capacity monolingual models and flexible multilingual models in a holistic system is a viable method for detecting online polarization. Our code is available on GitHub.
CUCLASIC at SemEval-2026 Task 5: LLM Prompting Strategies for Rating Ambiguous Word Senses
Federico Ortega Riba | Jasper Wilkerson | Kelsey Lafreniere Adams
Federico Ortega Riba | Jasper Wilkerson | Kelsey Lafreniere Adams
Word sense disambiguation has been a foundational task in computational semantics since the 1990s, but remains an unsolved problem when it comes to bridging human and computational evaluation of ambiguity. The SemEval-2026 Task 5 attempts to address this gap. We test six Large Language Models (LLMs) from the Llama and Gemini families in order to evaluate LLMs’ ratings of ambiguous textual excerpts, experimenting with zero- and few-shot variants of prompts and analyzing how simple linguistic cues improve performance. We propose a methodology of eliciting human-like ratings from language models by using examples with low and high standard deviations between human ratings. We further evaluate and compare the prediction patterns of different models and how they align with the human generated ratings. Our best model (Gemini 3-Flash) achieves a 75% score combining Spearman correlation and accuracy within one standard deviation.
BBgame at SemEval-2026 Task 12: Small Lanugage Model Fintuning for Abductive Event Reasoning task
Shu Li | Huizhi(elly) Liang
Shu Li | Huizhi(elly) Liang
We introduce a three-stage training framework for abductive event reasoning(AER). The task dataset were decomposed into 3 subsets, causal judgment, cause generation, and multiple choice answering(MCQA). Abductive reasoning requires understanding complex causal relationships between events. However, small language models typically struggle due to the multi-step inference required. Our approach provided supervised fine-tuning with group relative policy optimization(GRPO) to enlarge the reasoning capabilities based on an 0.5b parameter model. On the SemEval-2026 Task 12 development set, out Casual-Qwen 0.5B model achieves $64.75\%$, abslute outperforming $63.78\%$ Qwen2.5:0.5b at $0.0975\%$. Our ablation study reveals that binary casual judgement rather than cause generation or direct MCQA training is the key skill for AER task, with more complex stages significantly underperforming due to the task misalignment or task complexicity.
VerbaNexAI at SemEval-2026 Task 7: Integrating Web Snippets and RAG for the Evaluation of Multilingual Cultural Knowledge in LLMs
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
In multilingual and multicultural contexts, LLMs require contextualization mechanisms to generate culturally coherent responses. In this sense, this study presents a LLaMA-based approach to answer short cultural questions in different languages within Task 7 of SemEval-2026 (Track 1: SAQ), without access to official training data. The system integrates controlled synthetic data generation, evidence retrieval through web snippets, and a Retrieval-Augmented Generation (RAG) framework with Few-shot learning. BLEnD is used solely as a thematic guide, ensuring semantic independence. During development, the LLaMA-3.1-8B model achieved 38.51\% global accuracy, while LLaMA-3.2-1B obtained 15.54\%. In large-scale evaluation (30,500 instances), the 1B model achieved 16.69\%, maintaining stability after prompt optimization. The results demonstrate that contextual retrieval improves multilingual cultural knowledge evaluation and highlight the importance of pipeline design and model capacity.
KDW at SemEval-2026 Task 12: Logic-Driven Distillation with Knowledge Graphs for Efficient Abductive Reasoning
Sihan Zhu | Hongjie Wu | Xinyan Xu
Sihan Zhu | Hongjie Wu | Xinyan Xu
Large language models (LLMs) such as GPT-4 and Gemini show strong reasoning ability but incur substantial computational cost in abductive reasoning settings. We present our system for "SemEval-2026 Task 12 — Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models", which integrates knowledge graph (KG) evidence extraction with knowledge distillation to transfer structured reasoning from a large teacher model to a compact student model. Our approach ranks 8th in the shared task while achieving performance comparable to frontier LLMs at a fraction of the inference cost.
kirito at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis via Sentence Structure Parsing Preprocessing and Prompt-Enhanced Instruction Tuning
Shuangjin Hu
Shuangjin Hu
Dimensional Aspect-Based Sentiment Analysis (DimABSA) integrates fine-grained aspect extraction with continuous Valence–Arousal (VA) regression, posing unique challenges for fine-grained opinion mining. This paper presents our system for SemEval-2026 Task 3, with task-aligned strategies for three heterogeneous subtasks. For the DimASR task, we frame dimensional sentiment prediction as a supervised regression problem, paired with Low-Rank Adaptation (LoRA)-based parameter-efficient fine-tuning and a deep nonlinear regression head. For DimASTE and DimASQP tasks, we propose a lightweight sentence structure parsing preprocessing module, combined with prompt-enhanced instruction tuning for unified structured generation of aspect elements and VA scores. Experimental results on the official English test sets show that our system outperforms both official baselines across most settings, with syntax-guided prompting effectively improving aspect-opinion alignment and the dedicated regression head enhancing continuous sentiment modeling stability.
YNU-NLP at SemEval-2026 Task 11: A Neuro-Symbolic Approach with Reflexion Mechanism Disentangling Content and Formal Reasoning in Language Models
Yu Wang | You Zhang | Hao Zhang | Dan Xu
Yu Wang | You Zhang | Hao Zhang | Dan Xu
This paper describes our systems for SemEval-2026 Task 11, Disentangling Content and Formal Reasoning in Language Models. We participated in all four subtasks, addressing the Content Effect-a phenomenon where models rely on real-world plausibility rather than logical validity. Existing methods, such as standard Chain-of-Thought (CoT) prompting or single-task Supervised Fine-Tuning (SFT), often struggle to completely decouple content from reasoning due to the inherent probabilistic biases in pre-trained models. To address these limitations, a hybrid neuro-symbolic framework based on the Qwen2.5-14B architecture is proposed, integrating multi-task instruction tuning with a robust neuro-symbolic pipeline. The principal innovation lies in the deployment of a Reflexion mechanism coupled with formal verification: natural language arguments are parsed into First-Order Logic (FOL) and subsequently verified by the Z3 Theorem Prover. Parsing anomalies are automatically rectified through an iterative self-correction module. The proposed system ranked 1st in Subtasks 1 & 2, 2nd in Subtask 4, and 9th in Subtask 3, validating its ability to decouple logical validity from content plausibility.
Duluth at SemEval-2026 Task 4: A Hybrid Approach to Narrative Similarity using Bi-Encoder Embeddings with Cross-Encoder Tie breaking using Learned Weights
Maxwell Bevers | Aidan Carlson | Ted Pedersen
Maxwell Bevers | Aidan Carlson | Ted Pedersen
We present a hybrid system for SemEval-2026 Task 4 on Narrative Similarity. Our approach decomposes the stories into four narrative components: theme, plot, emotion, and outcome. Each component is then encoded using a biencoder (all-mpnet-base-v2), and cosine similarities are combined through a learned pairwise ranking model. When similarity scores between candidate stories fall within a small margin of error, a cross-encoder (ms-marcoMiniLM-L-6-v2) is used as a tie-breaker. Our final system achieves 58.5% accuracy on the official test set, placing us at 39th overall. Error analysis reveals that the system struggles with complex themes, multiple protagonists, and contrasting outcomes.
This paper describes our system designed forSemEval-2026 Task 10: PsyCoMark—Subtask2: Conspiracy Detection. We proposed a two-stage approach that leverages large-scale pre-trained models and a fine-tuned smaller modelto detect conspiracy theories in text. In thefirst stage, we utilize a large model to test allthe test samples and filter out those that areclearly unrelated to conspiracy theories. Forthe remaining samples, we apply a retrieval-enhanced custom prompt strategy combinedwith the Roberta-Large model in the secondstage. This allows us to fine-tune the modelwith weighted predictions based on relevantretrieved information, enhancing detection ac-curacy. Our system achieved first place onthe leaderboard, with an impressive F1 Scoreof 0.8874. We also present a brief analysisof the effectiveness of the methods used, in-cluding the advantages and limitations of largemodel-based filtering and retrieval-augmentedfine-tuning.
Stochastic Gradient Descenders at SemEval-2026 Task 9: Few-Shot LLM Prompting for Polarization Type Classification
Huynh Phu | Dang Thin
Huynh Phu | Dang Thin
This paper presents our system for SemEval-2026 Task~9 (POLAR), Subtask~2, which focuses on classifying polarization types in social media text. We investigate three paradigms: (i) fine-tuning mDeBERTa-v3 with domain-adaptive pre-training, (ii) parameter-efficient adaptation of Qwen2.5-32B using LoRA, and (iii) few-shot prompting with Llama-3.3-70B-Instruct. Experimental results show that few-shot prompting, despite requiring no task-specific training, outperforms both fine-tuning and parameter-efficient approaches. Notably, it achieves non-zero F1 scores across all polarization categories, which is critical under macro-averaged evaluation. Our system ranks 2nd out of 29 English submissions on the official leaderboard, achieving an F1 Macro of 0.5157. These findings highlight the effectiveness of large instruction-tuned models in low-resource, label-imbalanced classification settings.
YNU-HPCC at SemEval-2026 Task 8: Parallel Generation and Multi-Metric Reranking for Faithful Extractive RAG
Bo Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Bo Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
This paper presents our approach for the SemEval-2026 Task 8: MTRAGEval (SubtaskB: Answer Generation), which challenges sys-tems to generate faithful, extractive answers to multi-turn questions based strictly on provided gold-standard reference passages. The primary scientific challenge lies in maintaining high faithfulness and structural consistency while adapting to diverse answer styles across a conversation, as systems must generate responses that vary significantly in length and format without hallucinating. Conventional reference-based generation methods often rely on static prompting or greedy decoding, which fail to capture these dynamic stylistic requirements and lack robustness against generation noise. To address these limitations, we propose a Intent-Aware Parallel Generation and Reranking System powered by a large language model. Experimental results on the official test set demonstrate the effectiveness of our method, achieving competitive performance comparable to SoTA baselines. Ultimately,our approach secured the third place in the competition. The code of the paper is available at: https://github.com/viaviachris/SemEval-2026-Task8
ICT-NLP at SemEval-2026 Task 3: Less Is More — Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression
Liyuan Huang | Jiawei He | Wutao Shen | Lin Li | Jin Zhang
Liyuan Huang | Jiawei He | Wutao Shen | Lin Li | Jin Zhang
This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.
SemEval-2026 Task 13: Fine-tuned CodeBERT with Stratified Balancing, Dynamic Threshold Optimization, and Logit Bias Correction for Robust Multi-Language AI Code Detection
Udaythalavesh S | Rajalakshmi Sivanaiah | Angel Deborah S
Udaythalavesh S | Rajalakshmi Sivanaiah | Angel Deborah S
We present a CodeBERT-based system for detecting AI-generated code in SemEval-2026 Task 13 Subtask A. To address class imbalance and model overconfidence, we apply stratified balanced subsampling, dynamic per-epoch F1-macro threshold optimization, and label-flip bias correction. The model is trained using TPU-accelerated fine-tuning and achieves a validation F1-macro of 0.874 and a private leaderboard F1-macro of 0.53. Ablation studies confirm the effectiveness of our balancing and calibration strategies under distribution shift.
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
Gabriel Stefan | Sergiu Nisioi
Gabriel Stefan | Sergiu Nisioi
We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.
LocuPrompt at SemEval-2026 Task 7: A Multilingual Prompting Framework for Cross-Cultural Everyday Knowledge in LLMs
Ningjingke Ning
Ningjingke Ning
Understanding everyday cultural knowledge remains a fundamental challenge for large language models (LLMs). This paper presents LocuPrompt, a multilingual framework for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. To address Short Answer Questions (SAQ), we employ an English-pivot generation strategy with back-translation, combined with empirical locale-specific routing that dynamically assigns the optimal LLM to each target region. For Multiple-Choice Questions (MCQ), we apply parameter-efficient fine-tuning to a robust multilingual base model, utilizing locale-aware instructions that frame the LLM as a "local resident." By integrating strategic model selection with resource-efficient adaptation, LocuPrompt effectively bridges cross-lingual cultural gaps while maintaining a fully reproducible pipeline.
SlugRAG at SemEval-2026 Task 8: Domain-Specific Fine-Tuning and Model Scaling for Multi-Turn RAG Retrieval
Pratibha Revankar | Jihye Kim | Umit Azirakhmet
Pratibha Revankar | Jihye Kim | Umit Azirakhmet
Multi-Turn Retrieval-Augmented Generation (MT-RAG) requires resolving context-dependent ambiguities across conversational turns. We present a systematic evaluation of dense retrieval optimization for the MTRAGEval benchmark (Task 8, Subtask A: Retrieval Only), investigating training-time strategies and inference-time query reformulation across four diverse English-language domains: CLAPNQ (legal/patent), FIQA (financial), GOVT (government documents), and CLOUD (cloud computing). Our experiments demonstrate that domain-specific fine-tuning yields the most substantial gains, with our best CLAPNQ model achieving Recall@10 of 0.6016 and nDCG@10 of 0.4981—representing 58.3\% and 66.0\% improvements over the pre-trained BGE baseline. Domain-specific models average 44.3\% improvement in Recall@10 and 47.8\% in nDCG@10 across all domains. Additionally, fine-tuning larger embedding models (BGE-large) achieves the best overall performance (nDCG@10: 0.5101, Recall@10: 0.6221), highlighting the complementary impact of model capacity and domain adaptation.
PEU Lab at SemEval-2026 Task 4: Pairwise Text Comparison using RoBERTa and Ranking Loss
Hangchao Ma | Jiaxu Dao | Jinli Tong | Zhuoying Li | Qingsong Zhou | Xiuzhong Tang
Hangchao Ma | Jiaxu Dao | Jinli Tong | Zhuoying Li | Qingsong Zhou | Xiuzhong Tang
This paper describes the system developed by the PEU Lab for SemEval-2026 Task 4, specifically focusing on Track A: Comparative Narrative Similarity. To address the pairwise nature of the task, a lightweight contrastive ranking approach is proposed. Specifically, the pretrained RoBERTa-Large model is utilized to encode the anchor and candidate stories. Rather than employing standard cross-entropy, a margin ranking loss is introduced, which allows the relative narrative proximity between different candidate stories to be explicitly modeled. Furthermore, a 5-fold cross-validation ensemble strategy is integrated to stabilize predictions on unseen data. Evaluated on the official dataset, the optimal configuration achieved an overall accuracy of 64.50%, demonstrating the effectiveness of relative order modeling. The code for this system is available at: https://github.com/mhchhh/SemEval2026-Task-4.
YoungDSMLKZ at SemEval-2026 Task 13: MIL-UniXcoder with Meta-Stacking and Handcrafted Features for AI-Generated Code Detection
Yeraly Gainulla | Agzam Shamsadinov
Yeraly Gainulla | Agzam Shamsadinov
We propose and validate a multi-view ensemble framework for 4-class AI-generated code detection (Human, AI, Hybrid, Adversarial) in realistic long-form repositories. Our system, Team YoungDSMLKZ, ranked 1st out of 50+ teams in SemEval-2026 Task 13 Subtask C with a macro F1 of 0.7855 (+5.2 over runner-up). The framework combines: (i) a Dynamic Multiple Instance Learning (MIL) pipeline over UniXcoder chunks for O(N)-scalable long-context detection, (ii) transformer-based meta-stacking (UniXcoder and ModernBERT), and (iii) an XGBoost classifier on 200+ handcrafted stylometric features. Evidence localization analysis shows that 62.4% of decisive AI-detection signals reside beyond the standard 512-token window, validating the MIL design.
VARH-AI at SemEval-2026 Task 10: Exploiting Architectural Diversity with Transformer-SSM Ensembles and Confidence-Based Iterative Refinement for Conspiracy Detection
Hritav Solanki | Shubham Sharma | Manish Prasad | Rakhi Agrawal | Yashvardhan Sharma
Hritav Solanki | Shubham Sharma | Manish Prasad | Rakhi Agrawal | Yashvardhan Sharma
This paper describes our system for SemEval 2026 Task 10 (PsyCoMark), focusing on Subtask 2: binary conspiracy classification in Reddit submission statements. We present a heterogeneous ensemble approach that combines Transformer-based models (DeBERTa, RoBERTa) with State-Space Models (Mamba) to leverage architectural diversity for improved generalization. Our key contributions include: (1) Bidirectional Mamba (BiMamba), adapting state-space sequence models for bidirectional document classification; (2) (2) a safety-switched multi-task training setup that uses marker supervision only for gold-annotated samples, preventing noisy pseudo-labeled rows from affecting the span extraction objective; and (3) Confidence-Based Iterative Refinement, using committee voting for high-quality pseudo-label generation. Our best official submission achieved a weighted F1 score of 0.78 on the Subtask 2 test set, ranking 4th on the public CodaBench leaderboard. We provide detailed ablation studies demonstrating the complementary contributions of each architectural component to inform future research directions.
HABIBTAZ at SemEval-2026 Task 11: Disentangling Formal Logic from Content via Synthetic Training and Multi-Objective Optimization
Abdullah Shaikh | Zain Naqi | Taha Zahid | Sandesh Kumar | Abdul Samad
Abdullah Shaikh | Zain Naqi | Taha Zahid | Sandesh Kumar | Abdul Samad
While Large Language Models (LLMs) excel in many general NLP tasks, their formal reasoning capabilities are often compromised by content effects, demonstrating a measurable bias towards real-world plausibility. In this paper, we present our system for SemEval-2026 Task 11, which evaluates the ability of models to disentangle formal logic from content across 12 languages with and without distractor premises. We address this challenge using mDeBERTa-v3 networks fine-tuned on a synthetic, rule-based dataset of syllogistic schemes to avoid the semantic noise of LLM-augmented data. To explicitly decouple plausibility from logical structure, our training pipeline employs a multi-objective loss function combining Adaptive Group Distributionally Robust Optimization (DRO), a scheduled differentiable bias penalty, and KL-Divergence consistency regularization. Our system achieved #1 ranks and perfect Ranking Scores (100.0) with 0.00% bias and 100.0% accuracy on Subtask 1 (English), Subtask 2 (Noisy English), and Subtask 3 (Multilingual). On the highly complex Subtask 4 (Noisy Multilingual), the system achieved the 6th rank with 89.06% Accuracy and F1-score, alongside a limited 2.89% Bias and a 37.78 Ranking Score. Our dataset generation engine and codebase are publicly available to facilitate future work on robust logical reasoning.
TransformerTrio at SemEval-2026 Task 13: Navigating Domain Shift and Representation Instability in Machine-Generated Code Detection
Avi Patel | Manthan Laddha | Pushti Sapovadiya | Pruthwik Mishra | Shrikant Malviya
Avi Patel | Manthan Laddha | Pushti Sapovadiya | Pruthwik Mishra | Shrikant Malviya
Detecting machine-generated code is increasingly challenging due to advances in code generation models and domain variation across programming tasks. We present our submissions to SemEval-2026 Task 13, evaluating detection in three settings: binary human vs. machine classification, multi-class generator attribution, and four-way authorship classification including hybrid and adversarial cases. We compare feature-based, transformer-based, and hybrid approaches under domain shift and limited supervision. Results show that domain-specific signals often dominate model decisions, degrading generalization when training and test distributions diverge. Increasing model complexity does not consistently improve performance in low-resource or cross-domain settings and may amplify spurious correlations. These findings emphasize robustness and feature alignment over model sophistication for reliable detection.
SSN-CSE-CODECATALYSTS at SemEval-2026 Task 13: Integrating Transformer Semantics and AST-Derived Structural Features for AI-Generated Code Detection.
Bhuvana J | Ramanan Mahendran | Siddharth Chandrasekar S | Pragatheesh J | Rethanya P
Bhuvana J | Ramanan Mahendran | Siddharth Chandrasekar S | Pragatheesh J | Rethanya P
Pre-trained transformers often struggle with multi-lingual code classification due to sequence length constraints and difficulties in explicitly capturing deep structural complexities. To address this for SemEval Task 13, a hybrid neural architecture that fuses CodeBERT’s semantic embeddings is proposed. Handcrafted software engineering metrics is presented, with a Head+Tail truncation strategy to preserve crucial logic in long sequences while simultaneously extracting explicit Abstract Syntax Tree (AST) features via tree-sitter—including maximum depth, branching factor, and cyclomatic complexity. By integrating dense language model representations with explicit structural heuristics, this work provides a robust and scalable solution for enhanced code classification.
king001 at SemEval-2026 Task 7: Cross-Language Cultural Everyday Knowledge Q A System Based on RAG
Meizhi Jin | Zhichao Meng | Junqi Yin | Lianxin Jiang | Jianyu Li
Meizhi Jin | Zhichao Meng | Junqi Yin | Lianxin Jiang | Jianyu Li
This paper describes our system used in the SemEval-2026 Task 7: Cross-Language Cultural Everyday Knowledge QA (track 1). Cultural knowledge typically exhibits significant regional specificity and is deeply rooted in particular linguistic conventions, posing severe challenges to general-purpose large language models (LLMs). We propose a retrieval-augmented generation (RAG) framework: this framework utilizes text-embedding-v4 as the retrieval core to precisely extract social knowledge and expression patterns from region-specific large-scale multilingual cultural knowledge bases, and drives the gpt-5.2-chat model to generate concise answers that are both logically factual and highly aligned with the target region’s cultural context. In the official evaluation, our system ranked first among all participating teams with a total score of 78.7672, fully demonstrating the method’s outstanding performance in cross-cultural accuracy and linguistic authenticity.
SteerForce at SemEval-2026 Task 11: Reducing Content Effects Using Layered Activation Steering
Noah Tratzsch | Asmaa Al-Raian | Mounika Marreddy | Alexander Mehler
Noah Tratzsch | Asmaa Al-Raian | Mounika Marreddy | Alexander Mehler
Large language models exhibit content effects, where surface plausibility interferes with formal logical reasoning. In SemEval-2026 Task 11, this appears as a performance gap between plausibility-aligned and plausibility-conflicting syllogisms, reflecting directional content bias. We address this issue using inference-time activation steering, modeling bias as a geometric deviation between plausibility-driven and validity-driven representations. We introduce a layered steering framework that combines Activation Transport (ACT) with input-adaptive contrastive steering (K-CAST), applied to layers identified through sensitivity analysis. This architecture-aware strategy enables targeted interventions without retraining.On BERT, sequential multi-layer steering improves validity accuracy from 77.1% to 82.3% while reducing bias by 75%. In contrast, for the decoder-only Qwen2.5-1.5B-Instruct, a single mid-to-late layer intervention reduces bias from 0.26 to 0.04 with modest accuracy gains, whereas multi-layer steering offers no additional benefit. These results reveal a fundamental architectural distinction: encoder-based models benefit from distributed low-intensity corrections, while decoder-only instruction-tuned models concentrate reasoning signals within a narrow late-layer band. Our findings demonstrate that effective bias mitigation requires architecture-aware activation steering.
Sabancigroup4 at SemEval-2026 Task 5: Uncertainty-Aware Semantic Plausibility Scoring via GNLL Regression and LLM Rationales
Salih Büyükbaş | Doruk Benli | Osman Kara | Dilara Keküllüoğlu
Salih Büyükbaş | Doruk Benli | Osman Kara | Dilara Keküllüoğlu
SemEval-2026 Task 5 is a shared task on rating the plausibility of an ambiguous homonym in a predetermined context. The dataset of this task consists of a precontext & sentence & ending combinations for each homonym, and the plausibility of the sample is manually rated by 5 annotators. The task of participating teams was to automatically predict the plausibility with respect to the mean rate given by the annotators. Unlike traditional models that rely on single-label selection, this task frames disambiguation as a probabilistic distribution over multiple plausible meanings. To this end, we propose an uncertainty-aware training strategy using GNLL regression, and semantic context enrichment through POS tags and LLM rationales. Our system exhibits competitive performance, achieving 90% accuracy within standard deviation and 81% Spearman correlation, and placing us in the ninth place in the leaderboard.
IITKanBDone at SemEval-2026 Task 8: MTRAGEval - Evaluating Multi-Turn RAG Conversations
Soumendra Ray | Garima Gupta
Soumendra Ray | Garima Gupta
This paper describes our system for the MT-RAG (Multi-Turn Retrieval-Augmented Generation) shared task, which addresses the challenge of multi-turn conversational question answering using retrieval-augmented generation. We participated in three sub-tasks of Task 8: Task A (retrieval), Task B (generation with reference passages), and Task C (end-to-end RAG). For Task A, we evaluated multiple retrieval approaches including BM25, BGE, and hybrid methods, achieving best performance with ELSER (Elastic Learned Sparse EncodeR) with nDCG@5 of 0.4018 (Rank 24/38). For Task B, we employed the Mistral-7B-Instruct-v0.2 model via HuggingFace for response generation using gold reference passages, achieving a harmonic mean score of 0.6976 (Rank 13/26). For Task C, we combined ELSER retrieval with Mistral-7B generation, using top-5 retrieved passages as context, achieving a score of 0.4289 (Rank 23/29). Our system demonstrates the effectiveness of learned sparse retrieval methods and instruction-tuned models for multi-turn conversational RAG scenarios.
asetclarity at SemEval-2026 Task 6: An Imbalance-Aware RoBERTa Cross-Encoder for Political Response Clarity Classification
Maria-Antonia-Emanuela Pascu | Dan Dodun-des-Perrieres | Daniela Gifu
Maria-Antonia-Emanuela Pascu | Dan Dodun-des-Perrieres | Daniela Gifu
We address response-clarity classification in political interviews as defined in SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions, Task 1, where systems must label each question–answer pair as Clear Reply, Ambivalent, or Clear Non-Reply. We present a reproducible end-to-end pipeline built around a single-stream RoBERTa-large cross-encoder fine-tuned for three-way classification using deterministic text normalization, concatenated QA inputs, and imbalance-aware training (minority oversampling and class-weighted loss). To improve robustness, we train a 5-fold stratified ensemble and combine models via soft-voting. Our official shared-task submission obtains 0.76 macro-F1 on the official leaderboard, ranking 16 out of 41 participating systems. Finally, we deploy the classifier in a lightweight web application supporting both direct text input and audio-based analysis through automatic transcription, enabling interactive inspection of predicted clarity categories.
FactUEP at SemEval-2026 Task 4: Structured Narrative Similarity Scoring with Aspect Decomposition and Weak-Signal Gating
Marcin Sawinski
Marcin Sawinski
This paper presents approach to narrative similarity prediction for SemEval-2026 Task 4 Track A. We introduce an LLM-based system that operationalizes the three core dimensions—Abstract Theme, Course of Action, and Outcomes—via schema-constrained prompting to enforce structured outputs and alignment with the annotation protocol. The system proceeds in three stages: structured aspect decomposition and scoring, weak-signal gating for low-confidence cases, and a targeted LLM-based tiebreak. The final model achieved near-human performance and ranked second on the Track A leaderboard.
Narrative Team at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding
Valentin Istrate | Mocanu Octavian | Tatiana Khaidukova
Valentin Istrate | Mocanu Octavian | Tatiana Khaidukova
This paper describes our system for SemEval-2026 Task 5, which focuses on predicting the plausibility of word senses in ambiguous narrative contexts. The task requires assigning a real-valued plausibility score to candidate word senses based on aggregated human judgments. Our approach compares two modeling paradigms: (i) a pretrained transformer-based regression model using DistilBERT fine-tuned on the task data, and (ii) a lightweight neural baseline based on a bidirectional LSTM trained either from scratch or initialized with GloVe embeddings. Input representations combine a candidate sense definition with the narrative context and target sentence, separated by a special token. On the official test set, the DistilBERT model achieves the strongest result among our submissions, with an Acc@SD score of 0.54 and Spearman correlation of 0.17, while the best BiLSTM submission reaches 0.52 Acc@SD and 0.02 Spearman correlation. Although DistilBERT performs best in our experiments, the recurrent baseline remains competitive under the tolerance-based metric. We discuss model variants, reproducibility details, and limitations of our analysis.
CSECU-DSG at SemEval-2026 Task 6: Imbalance-Aware Transformers for Unmasking Political Question Evasions
Subha Shesgin | Sumaiya Nazneen | Abu Nowshed Chy
Subha Shesgin | Sumaiya Nazneen | Abu Nowshed Chy
Clarity-Level Classification predicts the degree of clarity of a response to a query. It is essential to the advancement of many NLP activities, such as conversational AI, customer support automation, and instructional technology. However, it is challenging to assess answer clarity due to unclear wording, incomplete answers, and the contextual dependence between questions and answers. This paper describes our involvement in the shared work on Clarity Classification that SemEval2026 Task 6 created in order to address these issues. Using question-answer pair regression and classification, we suggested a transformer-based method. To train our model, we used a refined transformer model that included DeBERTa-v3-base. To address class imbalance, we used class-weighted loss functions and oversampling to implement class balancing. Results from experiments show that our suggested approach accomplished competitive performance.
YNU-ABSA at SemEval-2026 Task 3: A Unified Pipeline for Continuous and Structured Dimensional ABSA
Qimao He | Xiaobing Zhou
Qimao He | Xiaobing Zhou
Dimensional Aspect-Based Sentiment Analysis (DimABSA) aims to jointly model continuous Valence–Arousal (VA) regression and structured sentiment extraction at the aspect level in multilingual settings, requiring both fine-grained emotion modeling and structural consistency. Prior approaches often separate regression and extraction or rely on stagewise pipelines, which may limit numerical stability and structural alignment. To address this challenge, we propose a unified pipeline for all three subtasks of DimABSA Track A.Although Task 1 and Task 2/3 use different backbone architectures, they are integrated through consistent preprocessing, a shared dimensional sentiment perspective, and unified post-processing principles. For Task 1, we enhance aspect–context interaction via aspect-conditioned cross-attention and attention pooling, together with bounded output mapping and lightweight calibration for stable VA prediction.For Task 2/3, we formulate triplet and quadruplet prediction as constrained conditional generation with LoRA fine-tuning and structural validation. Experiments show consistent improvements across languages, including lower RMSE, higher correlation, and better cF1. Error analysis further shows that Arousal remains more difficult than Valence.
CuriosAI at SemEval-2026 Task 8: Hybrid retrieval system with repeated sampling for generation
Aiswariya Manoj Kumar | Hiroki Takushima | Fumika Beppu | Yuki Shibata | Daichi Yamaga | Takayuki Hori
Aiswariya Manoj Kumar | Hiroki Takushima | Fumika Beppu | Yuki Shibata | Daichi Yamaga | Takayuki Hori
SemEval-2026 Task 8 (MTRAGEval) evaluates multi-turn Retrieval-Augmented Generation (RAG) under conversational challenges such as non-standalone turns, underspecification, and answerability detection. These conditions amplify retrieval and generation errors that standard single-turn RAG pipelines fail to address effectively. We present a robustness-oriented multi-turn RAG system combining contextual query rewriting, heterogeneous hybrid retrieval fused with Reciprocal Rank Fusion (RRF), domain-adaptive Low-Rank Adaptation (LoRA) reranking, and repeated sampling with metric-guided selection. On the official test set, our approach outperforms the organizers’ baselines across all subtasks: Retrieval (nDCG@5: 0.5396 vs. 0.4795), Generation (0.7571 vs. 0.6390), and RAG (0.5486 vs. 0.5366). Our system ranks 5th in Subtask A, 5th in Subtask B, and 7th in Subtask C on the official leaderboard. These results demonstrate that calibrated hybrid retrieval combined with robust generation selection is effective for multi-turn RAG.
deepgpt at SemEval-2026 Task 1: A Chinese Humor Generation System via Instruction-Masked QLoRA and Reverse Constraint Data Mixing
城 陈
城 陈
AbstractThis paper presents the system description of the deepgpt team for SemEval2026 Task 1 (MWAHAHA: ComputationalHumor Generation), Subtask A. To address the challenge of generating highquality Chinese humor under strict textconstraints (e.g., incorporating speciffedrare words or relating to news headlines),we propose a parameter-eï¬ï¬cient generation system based on Qwen2.5-3B-Instruct.We reconstructed 8,000 multi-source Chinese jokes into a conversational instruction tuning format. Crucially, to mitigate the prevalent issues of formatting hallucinations and template collapse, we introduce a strict Instruction Masking strategy during 4-bit QLoRA ffne-tuning. Bycompletely isolating the loss calculationto the target humorous text, the modelis forced to treat constraints as conditional inputs rather than conversationaldistributions to mimic. Empirical resultsshow that this architectural interventioncompletely eradicates meaningless conversational ffllers. Our system signiffcantlyboosted the hard constraint adherence (CAcc) to 94.6% and achieved a highly competitive Elo rating of 903 in the oï¬ï¬cialPairwise Human Evaluation, validating theeffectiveness of speciffc masking ffne-tuningfor lightweight large language models instrictly constrained generation tasks.
CSECU-DSG at SemEval-2026 Task 10: Fine-Tuning DeBERTa Transformer Model for Conspiracy Detection
Debashish Chakraborty | Sumaiya Tabassum | Sabrina Ibnath | Abu Nowshed Chy
Debashish Chakraborty | Sumaiya Tabassum | Sabrina Ibnath | Abu Nowshed Chy
Conspiracy detection aims to determine whether a social media post expresses belief in conspiracy theories. This task is essential for understanding harmful online discourse and mitigating the spread of misinformation. However, detecting conspiracy beliefs is challenging due to subtle psycholinguistic cues and the strong contextual dependency of such claims. To address these challenges, SemEval-2026 Task 10 introduced a shared task named PsyCoMark. In this paper, we describe our approach to Subtask 2, which focuses on detecting conspiracy beliefs. We propose a transformer-based classification approach using a fine-tuned DeBERTa-v3-base model to detect conspiracy beliefs in Reddit comments. Each post is processed as a single input sequence. To address class imbalance and improve generalization, we employ class-weighted cross-entropy loss with label smoothing during training. Our approach achieves competitive performance, ranked ninth among participating teams. The findings demonstrate that fine-tuned transformer models effectively capture contextual and psycholinguistic patterns in conspiracy-related discourse and achieve competitive performance compared to other systems.
CUET-823 at SemEval-2026 Task 9: LoRA-Based Instruction Fine-Tuning of LLMs vs. Transformer Models for Bengali Polarization Detection
Arpita Mallik | Ratnajit Dhar
Arpita Mallik | Ratnajit Dhar
The rapid growth of social media has gone hand in hand with a sharp increase in heated public discussions, where debates about elections, conflicts, protests, and identity often turn into divisive and polarized rhetoric. In this paper, we present our system for SemEval 2026 Task 9 – Subtask 1: Multilingual Text Classification Challenge-Polarization Detection, focusing specifically on the Bengali language. The task is a binary classification problem aimed at determining whether a social media post exhibits attitude polarization, such as intolerance, dehumanization, deindividuation, vilification, or stereotyping toward others’ opinions, identities, or beliefs. Among 49 participating teams, our approach ranked 2nd, achieving a macro-F1 score of 0.8582. We experimented with both transformer-based models and large language models (LLMs), and observed that LoRA-based instruction fine-tuned LLM-based approaches delivered the strongest performance in detecting nuanced and context-dependent polarization in Bengali text.
H-RAG at SemEval-2026 Task 8: Hierarchical Parent–Child Retrieval for Multi-Turn RAG Conversations
Passant Elchafei | Hossam Emam | Mohamed Alansary | Monorama Swain | Markus Schedl
Passant Elchafei | Hossam Emam | Mohamed Alansary | Monorama Swain | Markus Schedl
We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent–child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RBagg: 0.2488, RLF: 0.2703, RBllm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.
SLPGFJWUWarda at SemEval-2026 Task 1: A Multimodal Vision-Language Approach for Humor Generation Using Fine-Tuned BLIP
Warda Yousaf
Warda Yousaf
We present a BLIP-based multimodal system for image-based humor generation submitted to SemEval-2026 Task 1 (MWAHAHA), focusing on Task B1. Our approach fine-tunes a vision–language model on meme-style captions and handles animated GIFs via representative frame extraction to generate culturally grounded humorous captions.
hllwan at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis via LLM Feature Fusion and Test-Time Adaptation
Jinglong Li | Yang Yang
Jinglong Li | Yang Yang
This paper describes the system developed bythe team for SemEval-2026 Task 3: Di-mensional Aspect-Based Sentiment Analysis(DimABSA). Unlike traditional categorical sen-timent analysis, predicting continuous Valenceand Arousal (VA) scores across multiple lan-guages and domains poses significant theoret-ical and engineering challenges. To systemat-ically address data scarcity and cross-domaindistribution shifts, we propose a highly robustframework. First, we implement a translation-based data augmentation strategy with preciseHTML-tag alignment to mitigate low-resourceconstraints. Second, we introduce an unsuper-vised opinion extraction module based on syn-tactic dependency parsing to explicitly capturesentiment-bearing words. Third, we designa Tripartite Feature Fusion architecture builtupon both encoder-only (DeBERTa-v3) andcausal LLM (Qwen2.5) models to dynamicallyaggregate global and localized aspect-opinionembeddings. Finally, we apply an unsupervisedTest-Time Adaptation (TTA) mechanism to cal-ibrate normalization layers on the fly. Our sys-tem demonstrates highly competitive perfor-mance while offering critical insights into thelimitations of LLMs in cross-lingual sentimenttransfer.
CITD@UIT at SemEval-2026 Task 4: Structured Reasoning and Metric Specialization for Narrative Similarity
Thach Nguyen | Duc-Vu Nguyen | Dang Thin
Thach Nguyen | Duc-Vu Nguyen | Dang Thin
We present a synergistic dual-track approach for SemEval-2026 Task 4 on narrative similarity, covering Track A (triple-wise classification) and Track B (narrative representation) through failure-driven data enrichment. The shared task received 71 final submissions from 46 teams across its two tracks. For Track A, we explore three reasoning strategies: hybrid Cross-Encoder–LLM arbitration (66.5% dev), DSPy-based component-wise decomposition (68.0% dev), and a multi-stage pairwise reasoning pipeline with enforced moral agency hierarchies, where the final Gemini 2.5 Pro/Flash system achieves 77.39% on development and 69.25% on test data, ranking 17th among 46 participating teams in the official evaluation. For Track B, we propose BGE-M3 (LoRA), an instruction-guided dense representation model trained with Multiple Negatives Ranking Loss (MNRL); since Track B provides only unlabeled story instances, we specialize the embedding space using adversarial samples synthesized from Track A failure cases, achieving 68.75% in the official evaluation and ranking 6th among 26 participating teams. Our analysis shows that narrative similarity depends more on outcome alignment and moral trajectory than lexical overlap, highlighting the complementary roles of explicit reasoning and task-specific metric-space specialization.
YNU-HPCC at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Mingyu Bai | Jin Wang | Xuejie Zhang
Mingyu Bai | Jin Wang | Xuejie Zhang
This paper introduces our approach to SemEval 2026 Task 5, which evaluates the rationality of word-sense scores in ambiguous stories through narrative comprehension. This task requires models to assess the consistency between a given word-sense definition and the meaning of an ambiguous target word in a short narrative context, and to infer a rationality score on a 1-5 scale. We experimented and compared multiple methods. These methods include multi-head ensembles that simulate the behavior of individual annotators, ordinal classification and regression methods that treat scores as ordered categories, and direct regression using mean squared error (MSE) or L1 loss to predict human-average consensus scores. Additionally, we investigated instructional fine-tuning with low-rank adaptation (LoRA) on large language models (LLMs) such as Qwen3-4B-Instruct and Phi-4-mini. Our experimental results show that the direct MSE regression method performs best. This study indicates that directly optimizing to approach human consensus scores is effective for this task, while methods that model individual annotator differences are less applicable.
pfr821 at SemEval-2026 Task 9: Multilingual Polarization Detection via Hybrid XLM-RoBERTa with Targeted Data Augmentation and Imbalance-Aware Training
Antoine Durand | Rémi Hamon | Matthieu Pereira | Nathan Boucneau | Paul Cintra
Antoine Durand | Rémi Hamon | Matthieu Pereira | Nathan Boucneau | Paul Cintra
This paper describes HYPOLDET, the system submitted by team pfr821 to SemEval-2026 Task 9 (Polarization Detection, Subtask 1), a binary classification task over 22 typologically diverse languages. Our approach combines three complementary contributions. We first extend XLM-RoBERTa-Large with a custom transformer encoder layer and a learned attention-based pooling mechanism (Hybrid Architecture), allowing the model to aggregate token-level signals beyond the [CLS] representation. We then augment training data through a targeted LLM-based synthetic generation pipeline (Grok API), producing culturally grounded examples for low-resource and imbalanced languages. Finally, we address class imbalance at the training level through an imbalance-aware regime combining a per-language balanced batch sampler, weighted focal loss, and label smoothing. Our best single model achieves an unweighted macro-averaged F1 of 0.796, and a lightweight ensemble reaches 0.798, ranking in the top 10 for 7 languages and 2nd place for Hausa.
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Spanakis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Panagiotis Spanakis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates and addresses these challenges separately. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an “Anti-Echo Chamber“ architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the “Reporter Trap“, where models falsely penalize objective reporting. Our system achieves 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2, ranking 3rd on the S1 development leaderboard and 8th on the test set, demonstrating that structured agentic deliberation is an effective alternative to fine-tuning for interpretable psycholinguistic NLP.
One and Only at SemEval-2026 Task 2: Evaluating Zero-Shot Autonomous LLM Agents and Heuristic Proxies in Ecological Affect Forecasting
Nam Dinh
Nam Dinh
This paper presents team One and Only’s sys-tem for SemEval-2026 Task 2: PredictingVariation in Emotional Valence and Arousalover Time (Soni et al., 2026). We investigatewhether zero-shot LLM reasoning can replacefine-tuning for ecological affect forecasting bycombining deterministic statistical priors withfrozen LLMs (Gemini 3 Pro, Claude Opus4.6, GPT-5.2). For short-term state changes(Subtask 2A), an OLS mean-reversion anchoris paired with LLM-generated impulses; forlong-term disposition changes (Subtask 2B),a Chain-of-Thought prompt drives direct nu-meric prediction. Our system underperformsfine-tuned approaches on both subtasks. How-ever, post-submission ablation across threeLLMs reveals a task-dependent pattern: CoTreasoning substantially improves dispositionforecasting (rV : −0.185 → +0.129; MAEV :0.899 → 0.422), while uncalibrated LLM im-pulses degrade state-change prediction due tovariance collapse (σpred = 0.41 vs. σgold =1.73). We provide a detailed diagnostic anal-ysis of these failure modes and release allprompts and outputs for reproducibility.
AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas | Giorgos Filandrianos | Maria Lymperaiou | Paraskevi Tzouveli | Athanasios Voulodimos | Giorgos Stamou
Stavros Gazetas | Giorgos Filandrianos | Maria Lymperaiou | Paraskevi Tzouveli | Athanasios Voulodimos | Giorgos Stamou
In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
TeleAI at SemEval-2026 Task 4: Few-Shot Narrative Similarity Modeling for Classification and Ranking
Weiwei Fu | Shiquan Wang | Ruiyu Fang | Shuangyong Song
Weiwei Fu | Shiquan Wang | Ruiyu Fang | Shuangyong Song
This paper presents a unified, task-adaptive modeling framework for the two tracks of SemEval-2026 Task 4: Narrative Similarity. For Track A, we build a three-stage pipeline of three-dimensional narrative-anchored chain-of-thought (CoT) reasoning, multi-view data augmentation, and Low-Rank Adaptation (LoRA) fine-tuning. For Track B, we design an architecture fully aligned with the ranking inference pipeline and task objective, along with corresponding data augmentation and expansion methods, and propose Smooth Cosine Contrastive Loss (SCCL) to stabilize training in low-resource settings. Systematic experiments verify the effectiveness of each core module, and our final systems rank 4th in both tracks, providing a reproducible technical solution for few-shot similarity modeling.
LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis
Baraa Hikal | Jonas Becker | Bela Gipp
Baraa Hikal | Jonas Becker | Bela Gipp
This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1–9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters (log σ²) to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence–Arousal difficulty profiles—from 0.66× for German to 2.18× for English—demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.
NYCU-NLP at SemEval-2026 Task 9: Stacking Small Language Models for Multilingual, Multicultural and Multievent Polarization Detection
Ding-Xiang Lin | Po-Chun Chu | Lung-Hao Lee
Ding-Xiang Lin | Po-Chun Chu | Lung-Hao Lee
This paper presents the NYCU-NLP system for SemEval-2026 Task 9 on online polarization analysis. Our approach explores the effectiveness of instruction-tuned small language models (SLMs), including Phi-4 (14B), Mistral-small-3.2 (24B), and Gemma-3 (27B), with task-specific prompting strategies and combined them via a stacking ensemble to leverage complementary modeling capacities. Evaluated across 22 languages and three subtasks, our system achieved macro-averaged F1 scores of 0.8071 for Polarization Detection (Subtask 1), 0.6108 for Polarization Type Classification (Subtask 2), and 0.5111 for Polarization Manifestation Identification (Subtask 3). Notably, our approach ranked first in 15, second in 12, and third in 10 of the 62 language-specific leaderboards, demonstrating the robustness and competitiveness of stacking-based SLM ensembles in multilingual settings.
d’Olle Grieze at SemEval-2026 Task 11: Comparing the Impact of Supervised Fine-Tuning and Activation Steering on Mitigating Content Effect Bias in Syllogistic Reasoning
Twan Huiskens | Tian Niezing | Koen Snelten
Twan Huiskens | Tian Niezing | Koen Snelten
We investigate the content effect bias in Large Language Models (LLMs) as part of SemEval 2026 Task 11. We compare the impact of supervised fine-tuning using low-rank adaptation against activation steering across several model families, including LLaMA, Gemma and Qwen. Our results show that SFT improves accuracy, with LLaMa 8B reaching 98.75\% accuracy. Activation steering offers limited effectiveness in mitigating the content effect bias. A logit lens analysis further reveals that fine-tuning successfully shifts the model’s focus toward logical structure, specifically within the later layers.
Cryptix at SemEval-2026 Task 4: Zero-Shot Bi-Encoder Modeling for Narrative Story Similarity - A Sentence Transformer Approach
Sushmitha M | Sarath Kumar P | Thanalaxmi S | Beulah A
Sushmitha M | Sarath Kumar P | Thanalaxmi S | Beulah A
This submission presents a zero-shot embedding-based approach for SemEval-2026 Task 4 on Narrative Story Similarity. The system employs the pretrained sentence-transformers/all-mpnet-base-v2 model within a bi-encoder architecture to generate 768-dimensional story embeddings. Narrative similarity is modeled using cosine similarity in embedding space for comparative prediction in Track A and representation generation in Track B. The approach does not involve task-specific fine-tuning and treats narrative comparison as a geometric proximity problem. Experimental results and error analysis highlight the strengths of pretrained semantic encoders in capturing thematic similarity, while revealing limitations in modeling deeper narrative structure and causal progression.
Königsberg at SemEval-2026 Task 13: Beyond Language Models: A Low-Resource Feature-Driven and Data-Flow Embedding Approach for Machine-Generated Code Detection
Shahir Habib
Shahir Habib
The rise of Large Language Models (LLMs)has increased the need for reliable detection ofmachine-generated code. This paper presentsa low-resource, hybrid detection frameworkdeveloped for for SemEval-2026 Task 13 ,designed to operate efficiently without the computational overhead of end-to-end fine-tuningof large models. Our approach combines(i) comprehensive feature extraction pipelinethat calculates interpretable software metricscapturing stylistic and structural properties ofcode, and (ii) we leverage the semantic capabilities of GraphCodeBERT by extractingfrozen embeddings from its pre-trained encoder to model semantic and data-flow information while preserving generalizability. Thisfusion enables efficient detection of machinegenerated code across multiple programminglanguages (Python, C++, Java, and Go) andimproves robustness under out-of-distributionsettings. This feature-driven fusion offers acompetitive, computation-efficient alternativeto purely LLM-based fully fine-tuned models,achieving an F1-score of 38.26.
NUST PsyAI at SemEval-2026 Task 10: Parameter-Efficient RoBERTa for Conspiracy Detection and Character-Level Marker Extraction
Mian Muhammad Husnain Akram | Mehwish Fatima
Mian Muhammad Husnain Akram | Mehwish Fatima
We present the NUST PsyAI system for SemEval-2026 Task 10 (PsyCoMark), targeting document-level conspiracy detection and character-level psycholinguistic marker extraction from Reddit discourse. Our system ranks 7th in Extraction and 8th in Detection on the leaderboard. We benchmark feature-based and transformer approaches, adopting RoBERTalarge with LoRA for parameter-efficient finetuning. For detection, RB-DET-LoRA outperforms all baselines, achieving weighted F1 0.79 (dev) and 0.76 (test), with robust generalization under blinded evaluation. For extraction, we contrast a unified multi-type BIO scheme with a decomposed per-type setup; the latter mitigates cross-label interference and improves boundary consistency, reaching Overlap F1 of 0.16 (dev) and 0.21 (test). Results reveal a clear asymmetry: detection benefits from contextual semantic modeling, while extraction is limited by sparse supervision and boundary-sensitive evaluation.
YNWAAZ at SemEval-2026 Task 1: Bridging the Semantic-Visual Gap: Multimodal Humor Generation
Mohammad Erfan Zare | Tahere Abbasi | Hadi Veisi | Sayin Ala | Hanieh Naderi
Mohammad Erfan Zare | Tahere Abbasi | Hadi Veisi | Sayin Ala | Hanieh Naderi
Developing Computational Humor systems at a multilingual and multimodal scale requires transcending simple text generation paradigms to focus on intent and context understanding. In this study, we address two key limitations in Foundation Models:Association Failure in textual tasks, which prevents the formation of coherent semantic links between incongruous concepts, and Temporal Blindness in video processing, which disrupts narrative comprehension. To tackle these challenges, we propose a unified architecture comprising an Intent-Aware RAG system for mitigating linguistic gaps across English, Spanish, and Chinese, and a Cascaded Visual Perception pipeline for modeling the narrative structure of video content. A key innovation of this work is the utilization of small language models (TinyLlama) as a SemanticDenoise Filter, converting noisy visual signals into structured, coherent textual representations. Experimental results demonstrate that this modular architecture reduces cultural-semantic gaps in certain languages and produces outputs that generally align better with human humor preferences, though highly nuanced languages still present a challenge.
Stylometry at SemEval-2026 Task 13: Clustered Stylometric Modeling for Machine-Generated Code Detection
Sruthi Santhanam | Parthib Sarkar | Yashvardhan Sharma
Sruthi Santhanam | Parthib Sarkar | Yashvardhan Sharma
Machine-generated code detection is examined under out-of-distribution conditions where robust generalization is required. A hybrid feature representation is used in which code snippets are encoded through character-level TF–IDF patterns together with explicit structural indicators capturing properties such as verbosity and formatting behavior. Variability across generators is handled through clustering-based expert specialization, and predictions are produced using an ensemble of logistic regression and Naïve Bayes models with calibrated thresholds. Experimental results show that the proposed approach performs competitively despite relying on simple linear classifiers. The findings suggest that persistent structural patterns in code provide reliable cross-domain signals for identifying machine-generated programs.
JCT at SemEval-2026 Task 8: Resource-Efficient Multi-Turn RAG via Nano-LLM Rewriting and Hybrid Reranking
Tal Farhan | Chaya Liebeskind
Tal Farhan | Chaya Liebeskind
This paper describes our system submission for SemEval-2026 Task A (MTRAGEval), focusing on multi-turn Retrieval-Augmented Generation (RAG). Conversational queries often suffer from contextual ambiguity, rendering standard retrieval methods ineffective. We propose a highly resource-efficient pipeline that decouples query understanding from retrieval using a 1.5B parameter Nano-LLM (Qwen) for query rewriting, followed by parallel hybrid retrieval (Qdrant) and Cross-Encoder reranking. During internal development, our optimized system achieved an nDCG@5 score of 0.1991 on answerable queries, outperforming the official BM25 baseline. On the official blind test set, the system achieved a score of 0.1744. While our absolute performance trails behind baselines utilizing massive 20B parameter models, our work establishes a crucial baseline for extreme resource efficiency in conversational RAG. We provide a comprehensive error analysis detailing the impact of domain shifts, retrieval funnels, and we conduct a qualitative analysis on the organizers’ surprise “Underspecified” class to highlight the vulnerabilities of generative query rewriting.
JIA at SemEval-2026 Task 10: A Dual-Track System with BERT-based Encoders and LLMs for Conspiracy Analysis
Jiayue Zhu
Jiayue Zhu
This paper presents a dual-track system for conspiracy theory detection and psycholinguistic marker extraction. We evaluate multiple architectures, including DistilBERT, BERT-Base, DeBERTa-V3, RoBERTa, and instruction-tuned Qwen2.5 models. Qwen2.5-14B (full-shot) achieves the best performance with a Weighted F1-score of 0.80 in the detection task. Marker extraction remains challenging: while the fine-tuned LLM performs best on "Actors," its limited generalization in categories such as "Evidence" and "Effect" highlights persistent semantic ambiguity.
AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations
Dimosthenis Athanasiou | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Dimosthenis Athanasiou | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
We describe the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C).Our approach is based on two main design principles. First, we adopt a query-diversity-over-retriever-diversity strategy, where multiple complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and combined using a variance-aware nested Reciprocal Rank Fusion scheme. Second, we employ an agentic generation pipeline that decomposes grounded response generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection.The proposed system achieves strong performance across subtasks, ranking first in Task A and second in Task B in the official evaluation. Our empirical findings indicate that query diversity over a well-aligned retriever is more effective than heterogeneous retriever ensembling, and that answerability calibration—rather than retrieval coverage—emerges as the primary bottleneck in end-to-end performance.
UNED at SemEval-2026 Task 9: Sentiment-Aware Transformer Models with Back-Translation Augmentation for Online polarisation Detection
Victor Garcia Sanabria | Alvaro Rodrigo | Roberto Centeno
Victor Garcia Sanabria | Alvaro Rodrigo | Roberto Centeno
This paper describes our submission to SemEval-2026 Task 9 (Subtask 1) on Spanish online polarisation detection. We investigate whether sentiment-adapted pretrained language models provide an advantage over general-purpose multilingual models for binary polarisation classification. Under a controlled training setup, we compare a base XLM-RoBERTa model, an emotion-adapted model, and a sentiment-adapted XLM-R model trained on Twitter data. To mitigate overfitting in the relatively small training dataset, we additionally apply back-translation as a data augmentation strategy. Experimental results show that the sentiment-adapted checkpoint consistently outperforms the alternative pretrained models under identical conditions. When combined with back-translation augmentation, the final system achieves a macro-averaged F1 score of 0.743 on the preliminary competition leaderboard. These findings suggest that prior adaptation to affective signals in social media can provide beneficial inductive bias for polarisation detection.
HyperparameterOmens at SemEval-2026 Task 13: Various approaches to detecting machine- generated code
Dmitry Sukhotin | How Yu
Dmitry Sukhotin | How Yu
We present our systems for SemEval-2026 Task 13, built on the Droid resource suite and benchmark setting. For Subtask A (binary classification of human-written vs. machine-generated code), lexical baselines such as TF–IDF and character n-grams transferred poorly from the LeetCode training distribution to the production-code evaluation split. After correcting pipeline errors that obscured true performance and selecting stable AST features under domain shift, our final system uses 5 uncorrelated features and achieves 0.57 macro F1 on the public test set.For Subtask C (4-way authorship classification of human, AI, hybrid, and adversarial) lexical baselines performed poorly under a significant vocabulary shift. Deep semantic models proved more promising, and a per-class weighted ensemble which included these models achieved 0.57 macro F1 on the public test set
Unibuc-NLP at SemEval-2026 Task 10: Unmasking Conspiracies with Pre-Trained Language Models
Teodor-George Marchitan | Liviu Dinu
Teodor-George Marchitan | Liviu Dinu
The paper describes the system submitted to SemEval-2026 Task 10 (PsyCoMark) Subtask 2: detecting whether a Reddit comment expresses a conspiracy belief. We investigate three modeling paradigms: (A) an embedding-and-classify pipeline using Jina-embeddings-v3, HateBERT and BERT-Sentiment with Optuna-tuned classical ML models, optionally enriched by 19 readability features from textstat; (B) end-to-end fine-tuning of encoder transformers (DeBERTa-v3-base, DistilBERT) with a compact 128-unit classifier head and multiple pooling strategies; and (C) parameter-efficient QLoRA fine-tuning of large decoder-only models (Mistral-7B-v0.3, Qwen3-0.6B). Our best system, DeBERTa-v3-base with a 128-dimensional classifier, achieves a weighted F1 of 0.74, ranking 29/52 on the official leaderboard. Post-submission analysis further reveals that a weighted pooling strategy outperforms [CLS] on the official validation split by +0.04, achieving a weighted F1 of 0.78 (rank 8/52), suggesting that conspiracy-relevant features are distributed across transformer layers rather than concentrated at the final output.
Team BOBW (Best Of Both Worlds) at SemEval-2026 Task 3: Modular Cross-Attention Encoders for Dimensional Aspect-Based Sentiment Analysis
Michal Rynowiecki | Rob Van Der Goot
Michal Rynowiecki | Rob Van Der Goot
This paper presents our system for SemEval-2026 Task 3, which identifies four-part opiniondetails in product reviews. We used a sequenceof pairs of BERT encoder models connectedby cross-attention layers. The cross-attentionmechanism provided marginally better resultsthan a self-attention equivalent, failing to show-case a significant improvement. Error propaga-tion through the pipeline hurt the correctness ofthe outputs, with certain stages collapsing thescores. The pipeline architecture’s performancewas largely independent of model size, sug-gesting that small modular encoders for down-stream tasks are an efficient alternative to largedecoder models. Our best model got a cF1score of 0.53 on restaurant data and 0.26 onlaptop data.
PolarizedTeam at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Maria Nestor | Maroan Al Shrafat | Ioana Pește | Daniela Gifu | Diana Trandabăț
Maria Nestor | Maroan Al Shrafat | Ioana Pește | Daniela Gifu | Diana Trandabăț
This paper presents the systems developed for SemEval-2026 Task 9, which targets the detection and categorization of multilingual, multicultural, and multi-event online polarization across 22 languages. To address the challenges posed by linguistic diversity and short, heterogeneous texts, we evaluate several Transformer-based architectures for multilingual polarization detection. Our approach models the task as a multi-label classification problem and incorporates mean pooling for sentence representation, focal loss to mitigate severe label imbalance, and label-wise attention mechanisms to capture polarization-specific linguistic cues. Experimental results show that combining robust multilingual encoders with label-aware modelling substantially improves the detection of polarized content across diverse communities and events
MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization
Maziar Kianimoghadam Jouneghani
Maziar Kianimoghadam Jouneghani
We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive selection strategy that chooses among multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: https://github.com/Maziarkiani/SemEval2026-Task9-Subtask1-Polarization.
Proofbusters at SemEval-2026 Task 11: Neuro-Symbolic Syllogistic Reasoning via LLM-Guided Structure Extraction and Deterministic Validation
Mohamed Ayman | Khaled Marzouk | Abdallah Mashaly | Ahmed Heriez
Mohamed Ayman | Khaled Marzouk | Abdallah Mashaly | Ahmed Heriez
This paper presents the **Proofbusters** system for SemEval-2026 Task 11 (English syllogism validity classification). The task evaluates whether language models can perform *formal* syllogistic reasoning independent of semantic content—i.e., without being swayed by *belief bias* (judging arguments by plausibility or world knowledge instead of logical validity).The main idea is **symbolic abstraction**: before predicting validity, each syllogism is converted into a content-invariant logical form so the model reasons over structure rather than over concrete terms. Inspired by Euler’s abstraction in the Königsberg bridges problem (stripping away geography to reveal pure structure), the paper explores three abstraction strategies of increasing formal rigor:1. **Template abstraction** — Replace categorical terms with generic placeholders (e.g., x, y, z); keep syntax and quantifiers. Serves as a baseline (82.20% accuracy).2. **Symbolic OOP abstraction** — Map entities and relations into an object-oriented constraint graph with explicit tracking of supersets, disjoint sets, etc. (88.84% with Qwen-7B).3. **Set-theoretic abstraction** — Translate premises and conclusion into formal set notation (e.g., \(A \subseteq B\), \(A \cap B = \emptyset\)) and enforce *existential import* (\(A, B, C \neq \emptyset\)) to align with Aristotelian logic. The solver never sees the original natural-language terms.The system uses a **two-stage pipeline**: a **Formulation** stage (natural language → symbolic representation) and a **Solver** stage (validity judgment from symbols only). The set-theoretic variant, using Gemini Flash 2.5 for formulation and Gemini Pro 2.5 for solving, achieves **98.95% accuracy** with **2.13** total content effect (TCE) and an **overall score of 46.23**, substantially outperforming both task baselines and the other abstraction variants.The **conclusion** is that belief bias in LLMs is tied to semantic surface form: *explicit abstraction into mathematical set notation* sharply reduces plausibility-driven errors. Robust logical reasoning likely requires **architectural separation** between semantic parsing and formal inference, rather than prompt engineering alone. Remaining challenges include formulation errors (e.g., quantifier misclassification), multi-step constraint composition, and negation–inclusion interactions. Future work may combine the abstraction pipeline with formally verified theorem provers and extend it to multilingual or more complex multi-premise reasoning.
VGU-M.Tech-AI at SemEval-2026: Multilingual Multi-Label Classification of Online Polarization Types via Weighted Transformer Fine-Tuning and Adaptive Per-Label Threshold Optimization
Abdulkadir Bichi | Jyoti Shekhawat
Abdulkadir Bichi | Jyoti Shekhawat
Abstract This research paper proposed a multilingual multi-label classification of online polarization types via weighted transformer fine-tuning and adaptive per-label threshold optimization (MMCOPT). Our task is to classify social media posts according to a given set of five labels. A post could be deemed to be politically, racially, religiously, or gender/sexually polarizing, or fall into the category of other. We incorporate a distilbert-base-multilingualcased model and attach a two-layer MLP head. We also use a class-imbalance-weighted binary cross-entropy loss and optimize thresholds for each class to improve the validation micro-F1 score. Our training set is drawn from the POLAR benchmark, the first large multilingual polarization dataset that includes posts from seven languages and multiple social media platforms. MMCOPT’s best internal validation micro-F1 score is 0.7855, and its macro-F1 score is 0.7749. Our model (team username: asbichi362) is ranked on the official Codabench leaderboard and shows competitive results across 22 language tracks of the research project multilingual polarization type classification, with its best results in Hindi (0.7429) and Urdu (0.7073).
Sylloscope at SemEval-2026 Task 11: Decoupling Logic from Belief via DeepSeek-Enhanced Distillation in Qwen Models
Zhanyu Chen | María Teresa Muñoz Martín | Sem Huisman | Jingjing Lan
Zhanyu Chen | María Teresa Muñoz Martín | Sem Huisman | Jingjing Lan
This paper presents our approach for SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We propose a neuro-symbolic teacher-student framework that utilizes DeepSeek-R1 as a Logical Auditor to generate a high-fidelity training corpus. We distill these analytical behaviors into Qwen-3 models using Low Rank Adaptation (LoRA), focusing on teaching the mechanics of logic rather than simple label matching. Our system yields robust results across both subtasks, with a ranking score of 39.81 (96.86% accuracy) on Subtask 1 and 26.02 on Subtask 3. However, validity bias partially persists, so we conclude that while structured distillation substantially mitigates belief bias, fully disentangling logical validity from plausibility remains a central challenge for future development.
VerbaNexAI at SemEval-2026 Task 6: Automatic Detection of Political Evasion through Hierarchical Classification with RoBERTa Large
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 6: CLARITY, a shared task on automatic detection of question evasion in political interview transcripts. The task requires classifying question-answer pairs into three clarity levels (Task 1) and identifying nine evasion techniques (Task 2). We propose and evaluate two independent systems based on RoBERTa-Large. The first is a standard sequence classifier that treats each question-answer pair as a single input sequence, leveraging RoBERTa’s native two-segment encoding to model the relationship between the two texts jointly. The second is a dual-encoder architecture that processes the question and answer independently and computes geometric interaction features to model the semantic misalignment between them explicitly. Both systems are trained on Task 2 labels and derive Task 1 predictions via the hierarchical mapping proposed by the task organizers. Our best result was achieved by the standard sequence classifier, reaching Rank 10 on Task 2 and Rank 25 on Task 1.
pamaldi at SemEval-2026 Task 11: Neuro-Symbolic Syllogistic Reasoning via LLM-Guided Structure Extraction and Deterministic Validation
Pasquale Grimaldi
Pasquale Grimaldi
We describe our participation in SemEval-2026 Task 11, Subtask 1: determining the formal validity of syllogisms in English while minimizing the influence of content plausibility. Our system implements a neuro-symbolic pipeline that strictly separates neural and symbolic components. An LLM extracts the formal structure of natural-language syllogisms — proposition types (A, E, I, O) and the three terms — while the syllogistic figure is computed deterministically and a symbolic validator checks whether the resulting mood–figure pair belongs to the 24 classically valid Aristotelian forms. On the official evaluation we achieve 96.34% accuracy, Total Content Effect (TCE) of 1.02, and combined score of 56.57. Compared to pure-LLM baselines on the same backbone, our system more than doubles the combined score (from 26.52 to 56.57) and reduces TCE by nearly an order of magnitude. Swapping the extractor to Claude Sonnet 4.5 preserves combined score and TCE, confirming that content-invariance is contributed by the symbolic stage rather than any particular LLM. A paraphrase probe reveals that the validator is invariant to surface form but the extractor is sensitive to premise ordering — a specific, fixable limitation we identify as the primary target for future work.
COODetect at SemEval-2026 Task 13: Unsupervised Latent Domain Adaptation for Out-of-Distribution AI Code Detection
Aldan Creo | Atharv Nair | Mohana Ravikumar | Vaishak Menon | Dario Wisznewer | Vaibhav Jain
Aldan Creo | Atharv Nair | Mohana Ravikumar | Vaishak Menon | Dario Wisznewer | Vaibhav Jain
The widespread use of AI-generated code raises questions about software maintenance and academic integrity. However, tools to detect it are still in their infancy. In this article, we explore the issue of out-of-distribution (OOD) detection; while embedder models like CodeBERT can easily achieve high accuracies in the context of their training data, they are unable to properly generalize to unseen contexts or programming languages. We argue that this is caused by an overfitting of such models to the training distribution, e.g. memorizing a language’s "AI syntax" instead of the true generative artifacts, and develop a approach that is able to naturally generalize to completely unseen languages and domains. Our system is also considerably more interpretable than the deep neural alternatives. In particular, we propose three orthogonal views (lexical, structural, and symbolic) to capture the AI-generated code’s indicators. To deal with OOD shift, we normalize the scores per language with Z-scoring and a Gaussian Mixture Model to remove the language bias automatically. We test our approach on the SemEval-2026 Task 13 dataset, where our experiments reached a macro F1 of 0.602 compared to the task baseline of 0.305, demonstrating the generalization capabilities of our system. We make our source code and data available at https://github.com/ACMCMC/COODetect.
NCL HKU-NarrSim at SemEval-2026 Task 4: Aspect-Based Agents and Supervised Contrastive Embeddings for Narrative Similarity
Jianfei Xu | Ting Zhu | Mingyang Chen | Huizhi(elly) Liang
Jianfei Xu | Ting Zhu | Mingyang Chen | Huizhi(elly) Liang
SemEval-2026 Task 4 on Narrative Similarity requires models to assess narrative alignment between stories rather than relying on surface lexical similarity. For Track A, we introduce the Aspect-Based Narrative Similarity Agents(ABNS-Agents), a two-stage agent-based framework. It extracts three core narrative aspects aligned with the task definition under a schema constraint, and then performs aspect-aligned similarity adjudication using an LLM decision model. For Track B, Narrative Supervised Contrastive Embeddings(NSConE) is based upon supervised contrastive learning to model narrative similarity. Our experiments show that ABNS-Agents achieves 70.25% accuracy on the test set, while NSConE reaches 68.5% test accuracy, demonstrating competitive performance across both reasoning-based and representation-learning paradigms. The findings highlight the effectiveness of aspect-aligned structured modelling and task-specific supervised contrastive learning for capturing narrative similarity beyond surface semantics.
ILab-NLP at SemEval-2026 Task 9: Comparing XLM-RoBERTa and LLaMA-2 for Multilingual Polarization Detection
Declan Booth | Gavin Abercrombie | Simona Frenda
Declan Booth | Gavin Abercrombie | Simona Frenda
This submission describes a system for SemEval-2026 Task 9, Subtask 1, focused on binary detection of polarized versus non-polarized posts in English and Spanish. We compare two approaches: a fine-tuned multilingual encoder model (XLM-RoBERTa) and a prompted generative model (LLaMA-2 7B). Our experiments show that XLM-RoBERTa delivers stronger and more stable performance overall, while LLaMA-2 is more prone to false positives in Spanish due to a strong bias toward predicting the polarized class. In addition to headline results, we analyse model behaviour using confidence signals and SHAP, and report efficiency measurements with CodeCarbon to highlight practical tradeoffs between performance and computational cost.
VerbaNexAI at SemEval-2026 Task 5: Few-Shot Chain-of-Thought with Selective Self-Consistency and Isotonic Calibration for Word Sense Plausibility Rating
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
We present a system for rating word sense plausibility in ambiguous narrative contexts for SemEval-2026 Task 5. Our approach ensembles three large language models (Llama-3.1 70B, Qwen-2.5 32B, and Gemma-2 27B) using a computationally efficient, uncertainty-aware pipeline. We combine few-shot chain-of-thought prompting with selective self-consistency, which applies stochastic multiple sampling exclusively to items identified as inherently ambiguous. This targeted strategy reduces inference costs by approximately 45% while maintaining robustness in predictions. To correct the systematic bias of LLMs toward extreme ratings, we apply isotonic regression to shift the output distribution toward patterns of human judgment. Our system achieves a Spearman correlation of 0.67 and an accuracy within 0.76 standard deviations, ranking 34th out of 79 participating teams (top 43% without task-specific fine-tuning). Detailed error analysis reveals that while our system performs strongly on clear contexts (ρ = 0.78), current prompting paradigms struggle significantly to model multimodal human disagreement in genuinely ambiguous cases (ρ = 0.58), highlighting an important challenge for future work on subjective semantic tasks.
NCL at SemEval-2026 Task 8: Deterministic Small-LLM RAG with Relation Classification
Zehao Liu | Huizhi Liang
Zehao Liu | Huizhi Liang
We present NCL’s system for SemEval-2026 Task 8B, the generation track for multi-turn retrieval-augmented dialogues. Our submission follows a compact and reproducible RAG pipeline: (1) global and local question rewriting with LLM-based multi-turn relation control, (2) passage reranking with BGE-M3, (3) context-level answerability filtering with strict binary LLM judgments (“yes”/“no”), and (4) deterministic inference with a small-LLM (Qwen2.5-1.5B-Instruct) plus post-generation quality fallback (cleaning, bad-answer gate, one stricter retry, then an IDK fallback).On the official test set, our system achieved a harmonic mean score of 0.5973 (RB${agg}$ 0.4993, RL$F$ 0.7235, RB${llm}$ 0.6105), ranking 19th out of 26 teams on the leaderboard.
SCUMesclab at SemEval-2026 Task 3: An Adaptive Dual-Track Framework for Dimensional Aspect-Based Sentiment Analysis
Chia-Yun Lee | Matus Pleva | Daniel Hladek | Ming-Hsiang Su
Chia-Yun Lee | Matus Pleva | Daniel Hladek | Ming-Hsiang Su
This paper describes our system for SemEval-2026 Task 3, which focuses on predicting continuous valence and arousal scores. The task poses significant challenges due to variations in data scale and pragmatic ambiguities across languages. To address these disparities, we propose an Adaptive Dual-Track Framework that dynamically selects modeling strategies based on task characteristics. For semantically stable tasks, we apply a robust single baseline optimized with layer-wise learning rate decay (LLRD) to ensure stability. For high-ambiguity scenarios such as the Environmental Protection domain, we adopt a heterogeneous ensemble strategy to mitigate prediction variance. Experimental results demonstrate that our system consistently outperforms the initial standard baseline across all subtasks. Furthermore, our lightweight approach exhibits remarkable parameter efficiency, achieving highly competitive performance against newly introduced large language model (LLM) baselines. Additionally, ablation studies reveal that under regression settings, conventional regularization techniques, cross-lingual data transfer, and homogeneous ensemble learning can lead to negative transfer, confirming the necessity of strategically diverging approaches tailored to linguistic characteristics.
PAI at SemEval-2026 Task 3: An LLM and Data Redistribution Adaptation-Based Predictive Strategy for Valence-Arousal Scores
Zhihao Ruan | Kaifeng Yang | Cheng Chen | Wenwen Dai | Wenjia Mao
Zhihao Ruan | Kaifeng Yang | Cheng Chen | Wenwen Dai | Wenjia Mao
To address the valence and arousal score prediction task in Dimensional Aspect-Based Sentiment Analysis (DimABSA), we propose a two-stage strategy. In the first stage, we conduct post-training on a Large Language Model (LLM) via a Supervised Fine-Tuning (SFT) scheme, followed by generating initial predictions for valence and arousal scores. In the second stage, we perform distribution adaptation on the initial results by leveraging the training set distribution through various techniques, including Gaussian distribution modeling, quantile mapping, and the Sinkhorn algorithm.
UCSC NLP at SemEval-2026 Task 10: Boundary-Aware Span Extraction and RoBERTa Classification for Conspiracy Detection
Dom Marhoefer | Milos Suvakovic | Glenn Grant-Richards | Aidan Pinero | Ryan King
Dom Marhoefer | Milos Suvakovic | Glenn Grant-Richards | Aidan Pinero | Ryan King
We present our systems for SemEval-2026 Task10 (PsyCoMark), addressing conspiracy markerextraction (Subtask 1) and document-level con-spiracy detection (Subtask 2). For marker ex-traction, we formulate the task as multi-labelspan classification over enumerated candidatespans, using IoU≥0.95 positive labeling, hard-negative sampling, and containment-based non-maximum suppression (NMS) with boundary-aware span representations. Document classi-fication is modeled independently using a se-quence classifier with label smoothing and astratified train–validation split. Analysis showsthat entity-like roles (Actor, Victim) are de-tected robustly, while abstract roles (Action,Effect, Evidence) remain sensitive to boundarycriteria. On the official test set, our systemsrank 7th in Subtask 1 (0.2251 macro F1) and12th in Subtask 2 (0.7694 weighted F1).
XplaiNLP at SemEval-2026 Task 1: BVAHAHA - Benign Violation Algorithm for Humor and Harmless Absurdity
Berk Bubus | Nebi Soyal | Vera Schmitt | Nils Feldhus | Veronika Solopova
Berk Bubus | Nebi Soyal | Vera Schmitt | Nils Feldhus | Veronika Solopova
We present BVAHAHA, a humor generationsystem for SemEval-2026 Task 1 (MWAHAHASubtask A), which frames constrained joke generation through the lens of Benign ViolationTheory (BVT). Given either two rare words ora news headline, the system generates contextually appropriate jokes while avoiding memorization and unsafe outputs. Our approachcombines BVT-guided humor generation witha parallel moderation pipeline ("Gatekeepers")that detects excessive emotional intensity andhate speech, triggering iterative revisions whennecessary. Finally, we employ an LLM-as-aJudge framework with persona-based rankingto approximate human humor preferences.
looploop at SemEval-2026 Task 3: A Dimensional Aspect-Based Sentiment System with DeBERTa Regression and Qwen3 Instruction Fine-Tuning
Liu Yang | Gang Hu | Jing Li
Liu Yang | Gang Hu | Jing Li
Aspect-Based Sentiment Analysis (ABSA) hasevolved to capture continuous affective states,posing challenges for traditional classificationmodels. We adopt a hybrid approach tailoredto the varying complexities of the subtasks. ForTask 1 (Valence-Arousal Regression), we em-ploy a discriminative architecture using pre-trained DeBERTa encoder with a MeanPool-ing mechanism to directly regress continuoussentiment scores. For Tasks 2 and 3, which re-quire complex structural extraction of opiniontriplets and quadruplets, we utilize a generativeapproach by fine-tuning the Qwen3-4B-Instructlarge language model via 4-bit QLoRA. Oursystem effectively handles both precise numer-ical regression and complex structural text gen-eration, achieving competitive results acrossthe English laptop and restaurant domains.
PFW at SemEval-2026 Task 6: Multi-Seed DeBERTa Ensembles for Political Response Clarity and Evasion Classification
Taleef Tamsal
Taleef Tamsal
This paper describes the PFW system for SemEval-2026 Task 6 (CLARITY), which addresses the classification of response clarity and evasion techniques in political interview question-answer pairs. Rather than relying on large language model prompting, we pursue a competitive non-LLM approach based on fine-tuning DeBERTa-xlarge and DeBERTa-v3-large with a multi-seed ensemble strategy: 5-fold cross-validation with 10 random seeds yields 50 models per architecture, combined through simple logit averaging. Our system achieves a macro F1 of 0.76 on Subtask 1 (clarity-level classification) and 0.50 on Subtask 2 (evasion-type classification). We also find that three post-hoc optimization techniques—learned ensemble weights, thresh old calibration, and hierarchical masking— each improve out-of-fold performance yet degrade evaluation scores by 0.02–0.10 F1. This pattern should be interpreted cautiously: the 237-sample evaluation set likely contributes substantial variance, and two of the three degradations fall within the ±0.06 95% CI expected from sampling noise. Still, the consistent directional pattern across all three prediction-level interventions provides suggestive evidence for an optimization paradox, highlighting the risk of overfitting to cross-validation predictions when evaluation data is limited. Our code is publicly available at https://github.com/ Taleef7/semeval-2026-task6.
PFW Task 8 at SemEval-2026 Task 8: Lightweight Tri-Fusion Retrieval with Prompt-Engineered Faithful Generation for Multi-Turn RAG
Taleef Tamsal
Taleef Tamsal
We describe PFW Task 8’s system for SemEval 2026 Task 8 (MTRAGEval), a benchmark for multi-turn retrieval-augmented generation across four English-language corpora. Our submission combines BM25, SPLADE-v3, and Jina Embeddings v4 with weighted reciprocal rank fusion for retrieval, plus zero-shot GPT 4o/GPT-4o-mini prompting for generation. Officially, our system ranks 6th of 26 on Task B (H = 0.756), 14th of 29 on Task C (H = 0.533), and 20th of 38 on Task A (nDCG@5 = 0.433). For the camera-ready analysis, we re-run retrieval at the official nDCG@5 cutoff, strengthen the prompt ablation with per-domain statistics and exact tests, and analyze official outputs by answerability and domain. On a balanced 100-example development sample, explicit citation-format instructions—not chain of-thought alone—raise citation use from 4% to 93%, and a fixed-context Task C control improves from H = 0.463 with GPT-4o-mini to H = 0.523 with GPT-4o. Official analytics also show near-perfect UNANSWERABLE handling (H = 0.990) but weak behavior on UNDERSPECIFIED turns, where the system answers or refuses instead of clarifying. Our code is publicly available.
YangSteam at SemEval-2026 Task 3: Transformer-Based Aspect-Aware Regression for Dimensional Sentiment and Stance Analysis
Tsung-Hsien Yang | Shu-Fei Yang
Tsung-Hsien Yang | Shu-Fei Yang
This paper describes our system for the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). We participate in Track A (DimABSA) and Track B (DimStance), both of which involve Subtask 1 – predicting continuous valence–arousal (VA) scores for given text–aspect pairs in English and Chinese.Our system combines pre-trained multilingual transformers with aspect-marker input encoding and dual regression heads for VA prediction, trained with a 5-fold cross-validation ensemble. We select XLM-RoBERTa-large as the backbone for Track A and mDeBERTa-v3-base for Track B based on systematic model comparison on the development sets. On the official test sets, our system substantially outperforms the organizer-provided baselines across all language domain settings. On the unofficial postevaluation leaderboard, the system achieves strong results on Chinese subsets, ranking 1st on zho-env (Track B) and 2nd on zho-fin (Track A).
PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
Srikar Kashyap Pulipaka
Srikar Kashyap Pulipaka
We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma 3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall out of 60 participating teams, with 1st place finishes in 2 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50% F1drops on the test set, highlighting the importance of generalization.
This paper describes Team SoloSemantics’ submissions to SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. We began with lightweight neuro-symbolic knowledge-graph baselines, but a triplet-tuned MPNet bi-encoder produced stronger semantic separation in our experiments. We adopted a shared dense encoder family across both tracks and kept the KG and fusion variants as diagnostic baselines. Team SoloSemantics ranked 22nd on Track A and 9th on Track B. Our reproducibility audit further shows that the KG branch was often too sparse on short summaries to represent abstract narrative relations reliably under the current extraction pipeline.
AsymVerify at SemEval-2026 Task 6: Asymmetric Confidence-Gated Verification for Political Evasion Detection
Sebastien Kawada
Sebastien Kawada
Political evasion is difficult to detect because evasive answers often appear cooperative while avoiding concrete commitment. We present AsymVerify, a confidence-gated verification system for SemEval-2026 Task 6, a three-way classification of Clear Reply, Ambivalent, and Clear Non-Reply responses. AsymVerify scored 0.85 Macro F1 on the evaluation split (Deval, n=237), placing 2nd out of 41 teams on the official leaderboard. The system first classifies each question-answer pair, then selectively applies downgrade verification (CR/CNR → AMB) or upgrade verification (AMB → CR) to low-confidence predictions. Development analysis shows that errors concentrate at the Ambivalent boundary in both directions, motivating this asymmetric two-verifier design while confidence gating keeps additional inference cost low. On Ddev (n=308), AsymVerify with GLM-4.7 gains +17.1 Macro F1 over single-pass classification at 1.48 calls/example, and the upgrade verifier alone improves every tested LLM backend on Ddev by +6.8 to +15.2 Macro F1 over its single-pass baseline. Code is available at https://github.com/kaons-research/AsymVerify-ACL.
NYCU Speech Lab at SemEval-2026 Task 3: Heterogeneous Model Ensemble with Adaptive Weighted Voting for Dimensional Aspect Sentiment Quadruplet Extraction
Hao-Chun Hsieh | Cheng-En Wu | Yuan-Fu Liao
Hao-Chun Hsieh | Cheng-En Wu | Yuan-Fu Liao
SemEval-2026 Task 3 (DimABSA) includes Dimensional Aspect Sentiment Quadruplet Extraction (DimASQP), which requires extracting structured tuples—aspect term, aspect category, and opinion term—together with continuous valence–arousal (VA) values from reviews (Yu et al., 2026a). In this work, we participate in Track A, Subtask 3. We describe NYCU Speech Lab’s submission for the Chinese Restaurant and Laptop domains. Our system is a post-processing ensemble over heterogeneous architectures: LoRA/QLoRA fine-tuned decoder-only LLMs, a fine-tuned encoder-only model, and (optionally) prompted API-based LLMs. To improve robustness under the continuous F1 (cF1) metric, we use validation-calibrated weighted voting for tuple selection and weighted VA fusion for numerical aggregation, with strict output validation to enforce task constraints. Experiments on a held-out validation split show consistent gains over single models and clarify the precision–recall trade-offs induced by the voting threshold. On the organizers’ released (tentative) test leaderboard snapshot, our submission ranks first in both domains.
CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity
Sebastien Kawada | Dylan Holyoak
Sebastien Kawada | Dylan Holyoak
Across self-consistency samples from an LLM, vote agreement tracks instance difficulty: on SemEval-2026 Task 4 (Narrative Story Similarity), supermajority cases (≥ 7/8 votes) resolve at 85% accuracy, split votes at 67%, and perfect ties at 61%, a monotone gradient that holds across the development set. We exploit this in CascadeMind, which routes eight Gemini 2.5 Flash votes by consensus, escalates split votes to additional sampling rounds, and falls through to a symbolic ensemble of theory-inspired narrative signals only on perfect ties (5% of cases). The system reached 72.75% on Track A test, placing 10th of 44 teams. Ablations show that the symbolic component contributes negligibly end-to-end and that nearly all gains come from confidence-aware routing. The takeaway is methodological: for narrative similarity, calibrating when to spend more compute on a hard instance matters more than adding auxiliary representations to reason about it. Code is available at https://github.com/chreia/CascadeMind-ACL.
YNU-HPCC at SemEval-2026 Task 13: Robust Machine-Generated Code Detection under Distribution Shifts
Lixian Xing | Jin Wang | Xuejie Zhang
Lixian Xing | Jin Wang | Xuejie Zhang
As Large Language Models (LLMs) become prevalent in software development, distinguishing machine-generated from human-written code is increasingly important. This paper describes the system developed by the YNU-HPCC team for SemEval-2026 Task 13, which evaluates detection under cross-language, multi-generator, and hybrid settings. Three modeling paradigms are systematically examined: encoder-based fine-tuning, feature-based machine learning, and task-specific robustness strategies. For Subtask A (Binary Detection), frozen pre-trained encoders and shallow stylometric features exhibit stronger cross-domain robustness than full fine-tuning, with indentation entropy identified as a key discriminative signal. For Subtask B (Multi-Class Attribution), a hierarchical two-stage framework is adopted to decouple human–machine discrimination from fine-grained generator attribution, alleviating severe class imbalance. For Subtask C (Hybrid Detection), a token-level splicing augmentation strategy combined with Supervised Contrastive Learning and Focal Loss is employed to model intra-sample stylistic variation. According to the official leaderboard, our system ranked 12th out of 81 teams in Subtask A, 14th out of 34 in Subtask B, and 8th out of 32 in Subtask C.
TeleAI at SemEval-2026 Task 6: A Confidence-Aware Multi-Stage Reasoning Framework with Chain-of-Thought
Lingling Shi | Haoyu Jin | Shiquan Wang | Fang Yu | Shuangyong Song | Xuelong Li
Lingling Shi | Haoyu Jin | Shiquan Wang | Fang Yu | Shuangyong Song | Xuelong Li
This paper describes our framework for SemEval-2026 Task 6 (CLARITY - Unmasking Political Question Evasions), which focuses on classifying clarity and fine-grained evasion types in political question-answering dialogues. We propose CAMSR-CoT, a confidence-aware multi-stage reasoning framework that unifies the two subtasks through hierarchical label modeling. The framework adopts a confidence-based routing strategy: high-certainty cases are directly resolved, while ambiguous samples are routed to deeper Chain-of-Thought reasoning stages with boundary-aware few-shot exemplars to mitigate label confusion. On the development set, our framework achieves Macro-F1 scores of 0.812 on SubTask 1 and 0.617 on SubTask 2. On the official hidden test set, it ranks 1st in both SubTask 1 (Macro-F1 = 0.89) and SubTask 2 (Macro-F1 = 0.68).
chengtang at SemEval-2026 Task 7: A Retrieval-Augmented Generation Framework for Cultural Perspective Alignment in Everyday MCQs
Cheng Tang | Zhichao Meng | Meizhi Jin
Cheng Tang | Zhichao Meng | Meizhi Jin
Large language models (LLMs) often exhibit significant cultural representation biases in multilingual everyday knowledge understanding, struggling to accurately capture region-specific customs and values. This paper presents our system submission for SemEval 2026 Task 7: BLEnD Challenge Track 2 (MCQ) (SemEval-2026 Task 7 Organizers, 2026). To address these challenges, we propose a training-free retrieval-augmented generation (RAG) framework. Without introducing any external data, we manuallyconstructed a localized multicultural knowledge base for each language-region and used text-embedding-v4 for region-specific cultural background retrieval. In the generation stage, we adopted a strict zero-shot setting: prompts contain no task instance question-answer examples, only injecting locale-relevant background cultural descriptions via RAG to compensate for contextual information absence, combined with a dual-model ensemble strategy using Gemini 3 Flash (preview) (Google DeepMind, 2025) and GPT-5.2 Chat (OpenAI, 2025). Our system achieved an overall score of 96.35 on the final Evaluation dataset.Additionally, we conducted in-depth analysis of model performance on specific languages, particularly highlighting severe cultural alignment challenges faced by large models in dialectal variants like Moroccan Arabic (ar-MA) and highly localized subjective Japanese (jaJP) everyday scenarios
Phatthachdau at SemEval-2026 Task 9: A Multi-Stage Augment-Judge-Train Pipeline for Multilingual Online Polarization Detection
Phan Phat
Phan Phat
Address the extreme label imbalance in the Hausa dataset where only 11% of instances are polarized—through the Augment-Judge-Train (AJT) pipeline. By leveraging Gemini 2.0 for taxonomy-driven data generation and an LLM-as-a-Judge layer for quality control, we expanded the minority class sixfold. Our ensemble architecture, combining specialized Encoders with LLM-LORA, achieved 1st Place in Hausa (0.8336 Macro-F1) and ranked in the Top 10 for English. These results demonstrate the efficacy of culture-aware synthetic data in enhancing social NLP for low-resource languages.
CYUT at SemEval-2026 Task 9: Monolingual vs. Multilingual LoRA Tuning for Multicultural and Multievent Polarization Detection
Shih-Hung Wu | Yun-Kuang Liao | Shih-Siang Su | Yi-Min Jian
Shih-Hung Wu | Yun-Kuang Liao | Shih-Siang Su | Yi-Min Jian
This study addresses SemEval-2026 Task 9 on Detecting Multilingual, Multicultural, and Multievent Online Polarization, exploring the performance differences between monolingual and multilingual LoRA (Low-Rank Adaptation) fine-tuning techniques when processing online polarization phenomena. The research points out that online polarization is not only a language phenomenon, but a complex social language problem highly influenced by cultural contexts and event backgrounds. To address the limitation of existing research that only treats polarization as a binary classification, this study participates in three levels of subtasks: Subtask 1: Polarization Detection, Subtask 2: Polarization Type Classification (e.g., politics, religion), and Subtask 3: Manifestation Identification (analyzing rhetorical strategies that construct polarization, such as stereotypes and dehumanization narratives). This study aims to establish a more contextually grounded and diagnostic model analysis framework to enhance the model’s generalization ability and fairness in cross-lingual environments. By exploring different fine-tuning configurations to build a robust ensemble system, the experimental results show that our approach demonstrates exceptional proficiency in the Chinese domain, securing the 1st place ranking in Subtask 1 (Polarization Detection) for Chinese. Furthermore, we observe that while the monolingual LoRA strategy exhibits strong performance in specific languages like Chinese, integrating it with multilingual LoRA models via ensembling provides the diverse features crucial for identifying complex cross-cultural rhetoric.
DeepSemantics at SemEval-2026 Task 9: Label-Wise Optimization with Adaptive Focal Loss for Polarization Manifestation Identification
Eliasse Tiao | Josue Edou | Mahugnon Gohouede
Eliasse Tiao | Josue Edou | Mahugnon Gohouede
In this paper, we present our system for SemEval-2026 Task 9, which focuses on the fine-grained identification of polarization manifestations in multilingual social media content.Our approach combines transformer-based encoders (RoBERTa-base for English and Afro-XLM-R-small for Hausa) within aOne-vs-Rest (OvR) framework, complemented by controlled oversampling, Adaptive Focal Loss, and label-wise threshold optimization. To mitigate severe class imbalance and label sparsity, we adopt language-specific optimization strategies supported by pairwise χ2 independence analysis.Our system achieves macro-F1 scores of 0.464 in English and 0.192 in Hausa on the official test sets, ranking 5th in Hausa and 14th in English on the official leaderboard.
UTokyo Tsuruoka Lab at SemEval-2026 Task 9: Efficient Single Forward Pass Inference for Multi-Label Polarization Classification
Howard Tangkulung | Yoshimasa Tsuruoka
Howard Tangkulung | Yoshimasa Tsuruoka
Detecting and interpreting polarized online content is increasingly crucial as online platforms become central to public information exchange. We present an efficient adaptation of large language models for multi-label polarization classification in SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization. Our single-forward-pass inference method outperforms baseline multi-step decoding approaches for multi-label classification by reducing error propagation while improving inference efficiency. Beyond performance and efficiency analysis, we investigate the cross-lingual transferability of the system, observing statistically significant generalization within language families, a result that offers a practical path for low-resource language adaptation. Our system ranked 1st in 8 languages for Subtask 1 and 6 languages for Subtask 2, and placed in the top 5 for 16 out of 22 languages across both subtasks.Overall, we provide a simple, effective, and efficient solution for multilingual polarization classification.
ALPS-Lab at SemEval-2026 Task 3: A Multilingual Generative LLM Approach for Dimensional Aspect Sentiment Analysis
Songqian Dai | Wei Lin
Songqian Dai | Wei Lin
We propose a SFT approach for the DimABSA shared task, which predicts aspect-level sentiment intensities using large language models. The approach uses Gemma-3 27B with QLoRA for efficient fine-tuning on multilingual datasets. Merging data across languages improves performance, especially in low-resource domains. Post-processing removes duplicate outputs for accurate evaluation.
XiaoM at SemEval-2026 Task 7: A Qwen-based System for Accurate Retrieval of Everyday Knowledge Across Diverse Languages and Cultures
Xiao Yao | Liang Yang
Xiao Yao | Liang Yang
This paper describes our system designed for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. We describe a practical inference system for a two-track benchmark consisting of short-answer questions (SAQ) and multiple-choice questions (MCQ). Our submission is implemented in a single script and targets competition constraints directly: strict TSV schemas, short answer limits, and reliability under batch inference. The system uses Qwen2.5-7B-Instruct with memory-aware initialization, deterministic decoding (no sampling, zero temperature), and post-processing rules that guarantee valid outputs. We further add retry-on-failure and file-write fault tolerance to reduce runtime interruptions.
MSqrd at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Syeda Samah Daniyal | Muneeba Badar | Manal Hasan | Shifa Shah | Sandesh Kumar | Abdul Samad
Syeda Samah Daniyal | Muneeba Badar | Manal Hasan | Shifa Shah | Sandesh Kumar | Abdul Samad
Online polarization, the critical division between social, political, or identity groups, often leads to hate speech and social fragmentation. Detecting polarization, especially across diverse linguistic and cultural contexts, is a critical challenge. This paper presents our submission for SemEval-2026 Task 9, which focuses on detecting online polarization of multilingual, multicultural, and multievent (Naseem et al., 2025). The task is divided into three subtasks: (1) binary polarization detection, (2) multi-label classification of polarization type (e.g., political, racial, religious), and (3) multilabel identification of its manifestation (e.g., stereotype, vilification, dehumanization). For each subtask, we employ fine tune BERT-based transformer models. Model configurations are described in Section 4. The results are evaluated using the F1 macro score. We have achieved scores of 78.6, 55.8, 44.6 on the developmenttest set for subtasks 1, 2, and 3, respectively. Overall, the results demonstrate the effectiveness of BERT-based models for multilingual polarization detection.
HU at SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection
Muhammad Quddussi Kashaf | Shahmir Mustafa Chaudhry | Marium Zeeshan | Nahyan Javed | Sandesh Kumar | Abdul Samad
Muhammad Quddussi Kashaf | Shahmir Mustafa Chaudhry | Marium Zeeshan | Nahyan Javed | Sandesh Kumar | Abdul Samad
Modern media poses a complex challenge to verifying the credibility of information and public discourse due to the advent of conspiracy theory content. This paper presents our methodology in "SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection". It consists of two subtasks: extracting psycholinguistic markers from text using Named Entity Recognition (NER) techniques, and classifying Reddit comments as conspiratorial or non-conspiratorial. Our approach involved: (1) diverse extraction methodologies, including traditional bio tagging schemes, the GlobalPointer framework, and the GLiNER2 architecture, (2) data augmentation and synthetic data generation via Large Language Models (LLMs), and (3) evaluating various transformer-based models, such as DistilBERT and Covid Twitter-BERT. Our final system achieves a macro F1 score of 0.26 on Subtask 1 and 0.76 on Subtask 2.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
Roman Derunets | Ivan Bondarenko | Oleg Sedukhin | Mikhail Komarov | Ivan Chernov | Mikhail Kulakov
Roman Derunets | Ivan Bondarenko | Oleg Sedukhin | Mikhail Komarov | Ivan Chernov | Mikhail Kulakov
This paper describes our first-place submission to Task B (generation with reference passages) of the SemEval-2026 Task 8 MTRAGEval shared task on multi-turn retrieval-augmented generation. We propose a heterogeneous ensemble of seven LLMs organised into two groups with distinct prompting strategies, and use a GPT-4o-mini judge to select the best candidate response for each instance. Our system ranked first among 26 teams, achieving a conditioned harmonic mean score of 0.78 and substantially outperforming the strongest organiser baseline (0.64). Ablation experiments show that diversity across model families, scales, and prompting strategies is critical: the ensemble consistently outperforms any individual model. We also include Meno-Lite-0.1, a 7B domain-adapted model with a favourable cost–performance trade-off, and present an analysis of MTRAGEval that highlights annotation limitations and directions for benchmark improvement.
AFourP at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Shrika Thota | Lakshmi Priya Swaminatha Rao | Shivaanee Sk | Thirumurugan Ra | Vishal Muralidharan | Dhannya Santhakumari Madhavan
Shrika Thota | Lakshmi Priya Swaminatha Rao | Shivaanee Sk | Thirumurugan Ra | Vishal Muralidharan | Dhannya Santhakumari Madhavan
We describe our submission to SemEval-2026 Task 2 (Subtask 1), which asks systems to predict continuous Valence and Arousal scores from ecological diary texts. We fine-tune RoBERTa-base with a single linear regression head, treating each essay independently. Our system scores rcomposite of .679 (Valence) and .466 (Arousal) on the official test set, placing 4th on the Subtask 1 leaderboard.
HausaNLP at SemEval-2026 Task 7: Prompt-based Hausa Cultural Question Answering
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
We describe HausaNLP’s submission toSemEval-2026 Task 7 Track 1 (short-answercultural question answering). Our system is atraining-free, prompt-based pipeline targetingnative Hausa (ha-NG). Two design decisionsdistinguish it from a generic zero-shot baseline.We use locale-conditional prompting: ha-NGquestions receive a system prompt instructingconcise standard Hausa output with explicitBoko-script characters (á, â, Î, ű). Second, weuse a two-model fallback pipeline: GPT-4o handles the primary pass, and Gemini 1.5 Flash retries any rows where the primary call returnedan error or empty output, separating modelknowledge failures from API-availability failures. On the official development leaderboard,our best run reached 36.4 accuracy. Error analysis shows that a non-trivial fraction of failures are placeholder strings caused by APIerrors rather than incorrect generations, andthat surface-level mismatches (verbosity, orthographic variation) account for many of the remaining errors. Code, prompts, and processingscripts are released for reproducibility.
Takoyaki at SemEval-2026 Task 3: Ensembling LLM Predictions using Demonstration Retrieval for Dimensional Aspect-based Sentiment Analysis
Kosuke Yamada | Sho Takase | Ryosuke Kohita
Kosuke Yamada | Sho Takase | Ryosuke Kohita
This paper describes our system for SemEval-2026 Task 3 (DimABSA). We participate in Subtask 2 (DimASTE), which requires extracting triplets of aspect term, opinion term, and valence-arousal scores from review sentences, and Subtask 3 (DimASQP), which additionally requires aspect category classification to form quadruplets. Our proposed system consists of a multi-step pipeline: (1) retrieval-based in-context learning using BM25 to select relevant demonstrations for LLM inference, (2) agreement-based ensemble combining LLM predictions from multiple retrieval variants, and, for a subset of datasets, (3) error-pattern correction refining uncertain predictions using correction rule sets based on training data. Retrieval-based ICL and the agreement-based ensemble show consistent improvements across languages and domains. Error-pattern correction yields further improvement for the Japanese dataset. To further investigate output quality beyond automated evaluation metrics, we conducted human evaluation. The results suggest that LLM-based labeling achieves higher agreement with gold labels than human annotators, and additionally indicate a discrepancy between automated scores and practical output quality.
Team hugang11 at SemEval-2026 Task 1, Subtask A (Chinese): A CoT-SFT, Teacher-Constructed DPO, and Deterministic Post-processing Pipeline for Humor Generation
Gang Hu | Liu Yang | Jing Li
Gang Hu | Liu Yang | Jing Li
We present a system for SemEval-2026 Task 1, Subtask A (Chinese), which addresses humor generation with a three-stage pipeline combining CoT-SFT, teacher-constructed DPO, and deterministic post-processing. Built on Qwen2.5-7B-Instruct-bnb-4bit, the system achieved a live leaderboard rating of 991 and ranked in the second group. Our results suggest that robust inference-time control is as important as alignment-oriented training for humor generation.
ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
Wicaksono M. | Joanito Lopo | Tack Hwa Wong | Muhammad Ravi Shulthan Habibi | Samuel Cahyawijaya
Wicaksono M. | Joanito Lopo | Tack Hwa Wong | Muhammad Ravi Shulthan Habibi | Samuel Cahyawijaya
Large language models suffer from content effects in reasoning tasks, particularly in multilingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.
Scmhl5 at SemEval-2026 Task 3: Uncertainty-Aware Adversarial Learning for Embedding Enhancement in Dimensional Aspect-Based Sentiment Analysis
Haohuan Chen | Han Liu
Haohuan Chen | Han Liu
This paper presents an uncertainty-aware adversarial learning framework developed for SemEval-2026 Task 3, a shared task focusing on Dimensional Aspect-Based Sentiment Analysis (ABSA). Our framework involves three key components: Uncertainty modeling, Heterogeneous Mixture-of-Experts (HMoE) architecture, and embedding-level adversarial training. Experimental results demonstrate that our framework effectively reduces the Root Mean Square Error (RMSE), thereby validating the synergistic advantages of uncertainty modeling and heterogeneous fusion strategies in fine-grained sentiment regression tasks.
Team VYN at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis
Vishal Thenuwara | Widanalage De Mel | Nisansa De Silva
Vishal Thenuwara | Widanalage De Mel | Nisansa De Silva
This paper describes our system for the DimABSA 2026 Shared Task (SemEval-2026 Task 3), Track A, covering all three subtasks. We develop two complementary approaches: (1) DESS (Thenuwara and de Silva, 2025), an adaptation of our span-based extraction model incorporating dual-channel GCNs and a valence–arousal (VA) regression head.
Dual-View Consistency Testing for Content-Invariant Multilingual Syllogistic Reasoning
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Team 0704mis addressed the SemEval-2026 Task 11 Subtask 3 by building a neuro-symbolic system designed for multilingual syllogistic validity classification across 12 typologically diverse languages. The process involves a neural parser that extracts logical forms from text, which are then validated by a symbolic verifier implementing the full set of 24 valid Aristotelian forms via a hash lookup.Our standout contribution is the dual-view consistency test: the system compares a "native" parse of the original text with a "masked" version where content terms are replaced by abstract symbols (X, Y, Z), only proceeding with high confidence if both views agree. By comparing how the model interprets the same logic in two different formats, the system can detect if the model’s reasoning changes when the context shifts from real-world objects to abstract symbols. The primary goal is to combat belief bias, the human-like tendency of LLMs to accept invalid arguments if the conclusion sounds true, or reject valid arguments if the conclusion sounds false. By enforcing this dual-view check, we found that symbol abstraction (View B) acts as a structural regularizer, forcing the model to ignore semantic interference and focus on the relationship between terms.
Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
David Caraman | Gheorghe Cosmin Silaghi
David Caraman | Gheorghe Cosmin Silaghi
We describe our system for SemEval-2026 - Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-finetuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
REGLAT at SemEval-2026 Task 9: Enhancing Arabic Online Polarization Detection Using AraBERT and Synonym Replacement Augmentation
Ahmed Fetouh | Mariam Francies | Nsrin Ashraf | Hamada Nayel | Rahmath Mohammed
Ahmed Fetouh | Mariam Francies | Nsrin Ashraf | Hamada Nayel | Rahmath Mohammed
In this paper, we present our system, which was submitted to SemEval-2026 Task 9 (Subtask 1: Polarization Detection) and focuses on binary classification of polarized content in Arabic social media text. To address Arabic linguistic variations, we propose a single-model approach that combines fine-tuned AraBERT with synonym-based data augmentation. On the Arabic bind set, our method achieves a competitive macro F1-score of 0.831 and an accuracy of 0.833. Among the 45 participating teams, our system ranked 11th overall, with a performance gap of 0.018 macro F1 from the top-ranked team (0.8488). The results show that a fine-tuned AraBERT with synonym replacement is a strong, simple, and reproducible baseline that outperforms more complex setups in dealing with Arabic attitude polarization nuances.
RAGTUM at SemEval-2026 Task 8: Contextual Query Rewriting and Dense Retrieval for Multi-Turn RAG
Finn Wigger | Maximilian Podolsky | Merle Wilmink | Zelong Peng
Finn Wigger | Maximilian Podolsky | Merle Wilmink | Zelong Peng
This paper describes the system developed by a team for the TUM practical course Human-Centered Computing: applications in natural language processing, network science, machine learning, and AI for the SemEval MTRAG. Our approach addresses the challenges of multi-turn retrieval-augmented generation (RAG) by combining context-aware query rewriting with a dense retrieval strategy. We employ a pipeline that cleanses noisy corpora and utilizes dense OpenAI embeddings via Milvus for robust retrieval, and leverages Gemini 2.5 flash family of models for standalone query generation and final response synthesis. Our system demonstrates the effectiveness of integrating high-precision retrieval with fact-based generation across diverse domains.
d-itlab at SemEval-2026 Task 12: Per-Option Surprisal and Multi-Stage Gating for Precision-Oriented Causal Reasoning
Yasunori Terao | Yuuki Tachioka
Yasunori Terao | Yuuki Tachioka
We describe the system submitted by d-itlab to SemEval-2026 Task~12 (Abductive Event Reasoning), which requires selecting the most plausible direct cause(s) of an observed event from candidate options grounded in reference documents. Our approach combines (i) per-option multi-stage LLM inference that evaluates each option independently with progressively stricter verification, (ii) surprisal-based features obtained by teacher-forcing candidate sentences and measuring token-level negative log-likelihood, and (iii) an XGBoost ensemble trained on these heterogeneous features to produce a precision-oriented final prediction. In the official test set, our system scored 0.91, ranking third among 116 participating teams.
AICOE-Tredence at SemEval-2026 Task 11: Mitigating Content Bias in Syllogisms via Symbolic Logic-Language Decoupling
Rakshith R | Ankush Chopra
Rakshith R | Ankush Chopra
Content bias remains a key limitation of large language models (LLMs), which often conflate formal logical validity with real-world plausibility. SemEval-2026 Task 11 examines this challenge through multilingual syllogistic reasoning, requiring models to judge validity independently of content. We propose a structure-first reasoning paradigm that abstracts natural language syllogisms into Aristotelian logical forms. By mapping arguments to mood–figure representations and classifying validity in this symbolic space, our approach removes semantic content from the reasoning process. On the private test sets of Subtasks 1 and 3, our method achieves a perfect combined score, with 100% validity accuracy and zero content bias in both English and multilingual settings using Gemini-3 Pro Preview. We also explore transferring this paradigm to smaller models via structural supervision, finding that distilled systems retain high accuracy with minimal bias. These results suggest that explicitly separating logical form from linguistic content is a promising direction for bias-resilient and cross-lingually robust reasoning in LLMs.
hermeneutichools at SemEval-2026 Task 4: Multiperspectivity as a Resource for Narrative Similarity Prediction
Max Upravitelev | Veronika Solopova | Jing Yang | Charlott Jakob | Premtim Sahitaj | Ariana Sahitaj | Vera Schmitt
Max Upravitelev | Veronika Solopova | Jing Yang | Charlott Jakob | Premtim Sahitaj | Ariana Sahitaj | Vera Schmitt
Predicting narrative similarity can be under-stood as an inherently interpretive task: differ-ent, equally valid readings of the same text canproduce divergent interpretations and thus dif-ferent similarity judgments, posing a fundamen-tal challenge for semantic evaluation bench-marks that encode a single ground truth. Ratherthan treating this multiperspectivity as a chal-lenge to overcome, we propose to incorporateit in the decision making process of predic-tive systems. To explore this strategy, we cre-ated an ensemble of 31 LLM personas. Theserange from practitioners following interpretiveframeworks to more intuitive, lay-style charac-ters. Our experiments were conducted on theSemEval-2026 Task 4 dataset, where the sys-tem ranked 13th out of 47 teams and achievedan accuracy score of 0.705. Accuracy improveswith ensemble size, consistent with CondorcetJury Theorem-like dynamics under weakenedindependence. Practitioner personas performworse individually but produce less correlatederrors, yielding larger ensemble gains undermajority voting. Our error analysis reveals aconsistent negative association between gender-focused interpretive vocabulary and accuracyacross all persona categories, suggesting ei-ther attention to dimensions not relevant for thebenchmark or valid interpretations absent fromthe ground truth. This finding underscores theneed for evaluation frameworks that accountfor interpretive plurality.
SLPGFJWUInsa at SemEval-2026 Task 1: Enhancing Linguistic Creativity for English Text-Based Humor
Insa Abbas | Sadaf Abdul Rauf
Insa Abbas | Sadaf Abdul Rauf
For Subtask A, our main goal is to create a joke generating system that focuses on humor generation under constrained conditions using unusual words and news headlines as input. We trained our model on LLM-generated and human-curated augmented data aimed to produce constrained humor and to bridge the gap between the two. We demonstrate that using parameter-efficient fine-tuning (PEFT) on high-quality pre-trained base models in conjunction with a well-crafted prompt design allows our model to produce high-quality innovative output while maintaining the desired style.
ConTexT at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Fakeha Faisal | Rubab Shah | Syeda Zaidi | Azkaa Nasir | Sandesh Kumar | Abdul Samad
Fakeha Faisal | Rubab Shah | Syeda Zaidi | Azkaa Nasir | Sandesh Kumar | Abdul Samad
In this paper, we report our system for SemEval-2026 Task 5, which predicts graded plausibility scores for target word senses in narrative context. We explore embedding-based similarity, transformer fine tuning, and a three-stage curriculum combining WiC pretraining, Wasserstein distribution learning, and KL-based calibration. Our best model, DeBERTa-XLarge with curriculum training, achieves 78% accu-racy within one standard deviation and a Spear-man correlation of 0.70, with an overall test score of 0.74. Results show that distribution modeling better aligns with human plausibility judgments than single-score prediction
TeleAI at SemEval-2026 Task 3: Large Language Models for Dimensional Aspect-Based Sentiment Analysis
Yan Zhou | Wangshicheng Wang | Shiquan Wang | Mengjiao Bao | Ruiyu Fang | Shuangyong Song | Yongxiang Li | Xuelong Li
Yan Zhou | Wangshicheng Wang | Shiquan Wang | Mengjiao Bao | Ruiyu Fang | Shuangyong Song | Yongxiang Li | Xuelong Li
This paper describes TeleAI’s system for SemEval-2026 Task 3, Track A, Subtask 1 (DimASR), which focuses on predicting continuous Valence-Arousal (VA) scores for specific aspects in text. We frame this task as an end-to-end regression problem and propose a robust framework utilizing Qwen2.5-7B as the feature extraction backbone, combined with parameter-efficient fine-tuning via LoRA. To enhance model generalization and mitigate domain shifts, we primarily leverage multilingual and multi-domain mixed training. Furthermore, our system integrates several optimization and robustness techniques to stabilize continuous score prediction, including R-Drop-style consistency regularization, embedding-level PGD adversarial training, Smooth L1 (Huber) loss, sigmoid-based output interval mapping, and post-hoc linear calibration. Our comprehensive ablations demonstrate that the combination of joint training and robustness regularizations substantially reduces the official evaluation metric, $RMSE{VA}$. The proposed system achieves highly competitive performance across multiple language and domain settings, demonstrating the efficacy of applying lightweight LLM adaptation for dimensional aspect-based sentiment analysis.
L3IRIT at SemEval-2026 Task 4: Learning Narrative Similarity from Aligned Film Plot Summaries
Ahmed Hamdi | Emanuela Boros | Jose G. Moreno | Adam Jatowt | Georgeta Bordea | Carlos-Emiliano González-Gallardo | Antoine Doucet
Ahmed Hamdi | Emanuela Boros | Jose G. Moreno | Adam Jatowt | Georgeta Bordea | Carlos-Emiliano González-Gallardo | Antoine Doucet
This paper presents the participation of the L3IRIT team in SemEval Task 4.The team is a joint research group working on narrative extraction from historical text, led by the IRIT laboratory (University of Toulouse) and the L3i laboratory (University of La Rochelle). Our participation is grounded in the construction of a novel bilingual resource extracted from Wikipedia by automatically aligning film plots. Leveraging this dataset, we train embedding models using contrastive learning objectives to capture higher-level narrative structures more effectively. The resulting resource goes beyond surface-level lexical overlap, providing supervision for narrative similarity without manual annotation. In addition, we introduce a named-entity masking strategy designed to promote narrative abstraction and reduce superficial entity-based matching. Overall, our approach aims to support representation learning that captures structural and event-level similarities across stories in different languages more effectively.Our system ranked in 24 of the 44 scoreboards for Task A and 20 of the 27 scoreboards for Task B, achieving accuracies of 65.75 and 61.00, respectively.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling
Fengze Guo | Yue Chang
Fengze Guo | Yue Chang
We present a multilingual system for SemEval-2026 Task 9 on detecting and characterizing online polarization across languages, cultures, and events. Our approach participates in all three subtasks and models each subtask independently using a heterogeneous weighted ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base. For the multi-label settings, we adopt weighted binary cross-entropy to mitigate severe label imbalance. The system is trained exclusively on the provided task data and achieves robust performance across languages.
ChulaNLP at SemEval-2026 Task 6: A Hybrid BERT-LLM Framework for Political Response Clarity and Evasion Detection
Wisarut Peerachaidecho | Attapol Rutherford
Wisarut Peerachaidecho | Attapol Rutherford
SemEval-2026 Task 6 (CLARITY: Unmasking Political Interview) focuses on detecting equivocation and evasion techniques in political interviews. While encoder-only models and Large Language Models (LLMs) individually struggle with this task, we propose a hybrid BERT–LLM framework to leverage their complementary strengths: the discriminative efficiency of fine-tuned encoders and the sophisticated reasoning of LLMs. We benchmarked several long-context architectures—DeBERTa, RooseBERT, and BigBird—finding that a truncated DeBERTa-large provided the most reliable candidates for the LLM. By using DeBERTa’s top-5 predicted labels as constrained options for LLM inference, we significantly improved evasion-level classification. This hybrid approach achieved competitive rankings in the shared task, placing 7th in Subtask 1 and 2nd in Subtask 2.
UTRAG at SemEval-2026 Task 8: History-Aware Query Rewriting and LoRA-Finetuned Generation for Multi-Turn RAG
Ke Zhou | Yi-Shan Lin
Ke Zhou | Yi-Shan Lin
This paper describes our system for SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations (MTRAGEval), which evaluates retrieval-augmented generation (RAG) in multi-turn, context-dependent settings. We improve retrieval with history-aware query rewriting and enhance generation faithfulness with a LoRA-adapted model, integrating both into an end-to- end pipeline.Our approach achieves competitive performance across all subtasks, with nDCG@5 of 0.4855 in Subtask A, a harmonic mean score of 0.6554 in Subtask B, and 0.5159 in Subtask C, outperforming strong baselines in Subtasks A and B while remaining competitive in Subtask C.Our analysis shows that increasing dialogue length introduces cumulative errors in history selection and query formulation, leading to incomplete or drifting retrieval results and increasing the risk of hallucination.
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse
Nawar Turk | Lucas Miquet-Westphal | Leila Kosseim
Nawar Turk | Lucas Miquet-Westphal | Leila Kosseim
In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and 59 on the 9-class Task 2 (3rd/33). Across 8 transformer encoders optimized through a four-stage pipeline, partial encoder layer unfreezing outperforms full fine-tuning by a wide margin. Combining English and multilingual encoders further improves ensemble performance over either family alone, despite multilingual models being individually weaker. Prompt-based LLMs, without any task-specific parameter updates, outperform fine-tuned encoders, particularly on minority classes; among open-weight LLMs, parameter count does not predict performance. Enriched input, concatenating the full interviewer turn, improves LLM performance but not that of encoders, an effect that persists with Longformer’s extended context window, suggesting the divergence is not attributable to sequence-length capacity alone in our settings. The Clear Reply/Ambivalent boundary remains the dominant failure mode, mirroring the disagreement among human annotators. Our code, prompts, model configurations, and results are publicly available.
Semantic Vectors at SemEval-2026 Task 9: Robust Multilingual Polarization Detection via Dual-Encoder Fusion and Expert Ensembling
Ankit Dash | Priyanshu Mittal | Piyush Prashant | Sunil Saumya
Ankit Dash | Priyanshu Mittal | Piyush Prashant | Sunil Saumya
We present SEMANTIC VECTORS, our system for POLAR@SemEval-2026 Task 9 on multilingual online polarization detection across 22 typologically diverse languages. Polarization is frequently conveyed through implicit rhetorical framing, making cross-lingual detection highly challenging. We address this with a Siamese dual-encoder jointly fine-tuning mDeBERTa-v3-base and XLM-ROBERTa-large via 4-bit QLoRA, fused with language-specific expert models (GBERT, Italian BERT, Swahili BERT) through an XGBoost meta-stacker with per-language Platt calibration. Rather than addressing class imbalance, focal loss functions as a hard-example miner, concentrating gradients on subtly framed instances rather than lexically obvious ones. Combined with per-language threshold optimization, our system achieves macro-F1=0.797 and accuracy=0.827 across all 22 languages.
NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence–arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, using dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language–domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.
SokraTUM at SemEval-2026 Task 3: A hybrid cascade of Label Distribution Learning, RAG supported generative extraction and contrastive metric learning for dimensional sentiment analysis
Denis Laschenko | Albert Korotyk
Denis Laschenko | Albert Korotyk
The Dimensional ABSA (DimABSA) sharedtask extends traditional aspect-based sentimentanalysis from categorical polarity to continuousvalence–arousal (VA) prediction. We presentour system for all three subtasks: DimensionalAspect Sentiment Regression (DimASR),Dimensional Aspect Sentiment Triplet Extrac-tion (DimASTE), and Dimensional AspectSentiment Quad Prediction (DimASQP).Due to the cascading nature of the differentsubtasks, we built a modular interlockingpipeline that uses classical Machine Learningand NLP methods.Experiments across domains show consistentgains in regression accuracy and structuredextraction performance. Our results demon-strate the effectiveness of distribution-awareregression, retrieval-augmented generation, andcontrastive prototype learning for dimensionalsentiment analysis.
NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1–5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.
RAGonauts at SemEval-2026 Task 8: BM25 Retrieval with Last-Turn Query Formulation for Multi-Turn RAG Conversations
Rajalakshmi Sivanaiah | Angel Deborah S | Karthik Raja C | Rithika S
Rajalakshmi Sivanaiah | Angel Deborah S | Karthik Raja C | Rithika S
This paper describes the submission to Task~A of SemEval-2026 Task~8: MTRAGEval, which evaluates passage retrieval for multi-turn Retrieval-Augmented Generation (RAG) conversations across multiple knowledge domains. The task requires retrieving relevant supporting passages given conversational history, where user queries often contain implicit references and incomplete contextual information. This paper proposes a lightweight and training-free retrieval framework based on BM25 ranking combined with conversational query formulation. Queries are derived from dialogue turns and retrieval is performed using domain-specific indices to preserve corpus relevance. Without neural retrievers or fine-tuning, our system achieves an nDCG@5 score of 0.2836 on the official evaluation set, ranking 33\textsuperscript{rd} on the leaderboard. This result demonstrates that sparse lexical retrieval remains an efficient and reproducible baseline for conversational RAG systems.
0704mis at SemEval-2026 Task 11: Single-Call Joint Abstraction for Robust Neuro-Symbolic Retrieval
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Neuro-symbolic Basis for Robust Syllogistic Reasoning Under Distractors.We present our submission to SemEval-2026 Task 11 Subtasks 2 and 4, on syllogistic premise retrieval with distractors. Our system is based on a robustness-first neuro-symbolic pipeline. The key innovation is single-call joint abstraction: rather than parsing all statements independently, one LLM call jointly abstracts all premises and the conclusion into categorical logical forms (A/E/I/O) where symbolic (X/Y/Z) mappings are globally consistent. This allows reliable detection of the shared middle term needed for syllogistic validation. Parsed forms are passed through an exhaustive O(n²) premise-pair search with deterministic validation against the 24 valid Aristotelian syllogistic forms via constant time lookup. Ablation studies show that more theoretically sophisticated variants degrade performance when logical-form extraction is the primary bottleneck. Our approach achieves competitive rankings in both English and multilingual settings while remaining simple, deterministic, and content-invariant.
CiNet-Handai-Kyodai at SemEval-2026 Task 5: Combining LLM Prompting, Semantic Similarity, and Synthetic Gaze for Graded Sense Plausibility
Lis Kanashiro Pereira | Fei Cheng
Lis Kanashiro Pereira | Fei Cheng
We present a hybrid system for SemEval-2026 Task 5 on graded word-sense plausibility in narrative contexts. Our approach combines prompt-based large language model (LLM) scoring with three complementary features: semantic embedding similarity, story-conditioned definition generation, and a synthetic gaze signal based on predicted fixation time. We combine these signals using an ordinary least squares regressor. On the official test set, our best system achieves 90.10 Acc±SD and 79.19 Spearman correlation. The system surpasses the reported human reference score on Acc±SD, highlighting the value of combining LLM-based judgments with targeted linguistic and cognitive-inspired features.
SRCB at SemEval-2026 Task 5 A Multi-Target Finetuning Framework for Large Language Models with Joint Regression and Text Generation
Yuming Zhang | Junyu Zhou | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong
Yuming Zhang | Junyu Zhou | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong
This paper presents our winning system for SemEval-2026 Task 5 on rating the plausibility of word senses in ambiguous stories. Unlike traditional Word Sense Disambiguation, the task requires predicting continuous plausibility scores that reflect human variability rather than selecting a single correct sense. We propose a multi-target fine-tuning framework for decoder-only large language models that jointly optimizes regression for score prediction and text generation for interpretable explanations. To address inter-annotator variability, we adopt objective-level strategies to enhance robustness. Our system achieves first place, demonstrating the effectiveness of unified regressive–generative modeling for fine-grained plausibility estimation.
ICT-NLP at SemEval-2026 Task 1: Humor Generation via RAG-based Augmentation and Multi-LLM Internal-External Voting
Wutao Shen | Liyuan Huang | Jiawei He | Lin Li | Jin Zhang
Wutao Shen | Liyuan Huang | Jiawei He | Lin Li | Jin Zhang
This paper presents the system we developed for SemEval-2026 Task 1: Humor Generation. The task focuses on developing systems capable of generating genuinely humorous content under various constraints. In this work, we propose using a Retrieval-Augmented Generation approach to preprocess news headlines and obtain summaries of news content. Furthermore, we employ a unified humor generation mode to adapt to the two types of generation constraints. Finally, we conduct an internal-external voting process to produce the final optimal joke output. Our approach achieves competitive performance in this task: it ranks 1st (tied) among all participating teams in the Chinese track of Subtask A.
ThinkVision at SemEval-2026 Task 6: A Transformer-Based Ensemble System for Clarity Detection
Purohit Ghanshyam | Praveen Swami | Shriyans Sahoo | Jenish Bhati | Supriya Nadiger | Sunil Saumya
Purohit Ghanshyam | Praveen Swami | Shriyans Sahoo | Jenish Bhati | Supriya Nadiger | Sunil Saumya
We study the problem of assessing the clarity of political question–answer pairs, where the goal is to determine whether a response directly addresses the question, avoids it, or remains ambiguous. This task is particularly challenging in political discourse, where evasiveness can be subtle and context-dependent.To tackle this problem, we propose an ensemble-based approach built on the transformer-based model DeBERTa-v3-base, fine-tuned on concatenated question–answer inputs. Special attention is given to class imbalance during training to ensure robust performance across all categories.To better capture uncertainty in borderline cases, we train multiple models with different random seeds and employ Monte Carlo Dropout at inference time. Final predictions are obtained by averaging logits across ensemble models and stochastic forward passes, yielding more stable and robust predictions.Our system achieves a Macro-F1 score of 0.76 on the evaluation dataset. Error analysis reveals that responses that partially engage with the question while failing to provide a direct answer remain the most challenging, highlighting the inherent difficulty of detecting nuanced evasiveness in political communication.
Team YTY at SemEval 2026 task 12: Option-Aware Retrieval and Cross-Encoder Reasoning Framework for Abductive Event Reasoning
Junxin Lin | Zhichao Meng | Lianxin Jiang
Junxin Lin | Zhichao Meng | Lianxin Jiang
We describe a unified system for SemEval-2026 Task 9 on multilingual polarization detection. The task requires binary polarization detection, multi-label target type classification, and multi-label manifestation identification across languages and events with severe class imbalance. Our approach combines (i) targeted data augmentation for low-frequency labels, (ii) merged multitask fine-tuning of Subtask 2 and Subtask 3, and (iii) model fusion to improve cross-lingual stability. Subtask 1 predictions are derived via calibrated inference from the multi-label head. On the development set, multitask training consistently out-performs single-task variants, and fusion yields additional gains, especially for rare labels. We also report ablations and error analyses, highlighting remaining challenges such as implicit polarization and partial-label uncertainty.
This article presents our study on task 10: Psycholinguistic conspiracy marker extraction and detection, which includes token-level extraction tasks and sentence-level conspiracy detection tasks. Focusing on conspiracy theory texts on social media, this paper proposes a classification method that combines semantic encoding with large language model reasoning and generation. Semantic features are extracted using DeBERTa-v3, and explanatory reasoning text is generated through ConspEmoLLM-v2. The two are then combined for classification, thereby enhancing the model’s ability to recognize implicit conspiratorial logic. For the extraction subtask, this paper provides systematic comparison results of several mainstream pre-trained models, mainly conducting baseline model comparisons and performance analysis.
SCUZANE at SemEval-2026 Task 3: Dimensional Aspect-based Sentiment Analysis with Supervised Contrastive Regression and R-Drop Regularization
Ziang Zhou | Xiangmei He | Chenhongyi Bai
Ziang Zhou | Xiangmei He | Chenhongyi Bai
Current Aspect-Based Sentiment Analysis (ABSA) often relies on coarse-grained categorical labels, such as Positive and Negative, and this often leads to fail capturing the subtle intensity of emotional expression in real-world text. To address this issue, the SemEval-2026 Shared Task 3: Dimensional ABSA (DimABSA) extends the traditional ABSA by replacing categorical sentiment polarity with continuous valence-arousal (VA) scores. In this paper, we describe our system for Subtask 1 (Dimensional Aspect Sentiment Regression) of Track A (DimABSA). Our system utilizes a DeBERTa-v3-large backbone, enhanced by a prompt-based learning strategy that concatenates aspect information with the context. And we employ multi-sample dropout and a weighted aggregation of the hidden states from the last four layers to capture rich semantic representations. Our experimental results across all provided domains on different languages demonstrate the effectiveness of integrating consistency regularization with dimensional contrastive learning for fine-grained sentiment regression.
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolaos Karafyllis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Nikolaos Karafyllis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
We present a winning three-stage system for SemEval 2026 Task 12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design informed by reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7 families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.
SilkPeak at SemEval-2026 Task 6: When Politicians Dodge — Unmasking Evasion in Political Interviews through Joint Multi-Task Transformer Learning
Amruth Tetakali | Lavanya Tetakali
Amruth Tetakali | Lavanya Tetakali
This paper describes a system for SemEval-2026 Task 6 (CLARITY), which focuses on recognizing evasive communication in political interviews. The approach treats the one subtask—determining the clarity level of an answer —as a single joint multi-task problem. A DeBERTa-v3-Large encoder is shared across both tasks, processing the question and answer as a single concatenated sequence. By updating independent linear classification heads simultaneously, the model allows the fine-grained learning signals from the evasion taxonomy to directly inform the broader clarity-level decisions, and vice versa. On the official evaluation set, this joint discriminative system achieves a 0.76 macro F1 score on Task 1. This approach significantly outperforms standard single-task baseline models, hierarchical bi-encoding architectures, and generative large language models like LoRA-tuned LLaMA-3-8B.
5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control
Thien-Qua T-Nguyen | Chi Hoang | Nguyen Tran | Tri Le | Khanh Truong | Chinh Nguyen
Thien-Qua T-Nguyen | Chi Hoang | Nguyen Tran | Tri Le | Khanh Truong | Chinh Nguyen
This paper presents a modular multi-turn Retrieval-Augmented Generation (RAG) system designed to mitigate hallucination, context drift, and underspecification. The pipeline combines dual-query merged retrieval and LLM-based reranking to deliver high-precision evidence, improving nDCG@5 by 17.7%. To strictly control hallucination during generation, we introduce a role-separated prompting strategy. - This approach explicitly isolates the conversation history (used solely for intent and coreference resolution) from the retrieved passages (enforced as the exclusive source of factual grounding). - By preventing the language model from misinterpreting prior dialogue turns as factual evidence, the system ranked 3/29 in the SemEval-2026 Task 8 end-to-end evaluation. - Notably, our faithfulness-oriented design achieved a high ROUGE-L F1 score of 0.7692, outperforming larger baselines and demonstrating that explicit grounding constraints are highly effective at ensuring lexical faithfulness and reducing hallucinations.
TeamOmega at SemEval-2026 Task 13: Frozen vs. Trainable Representations for Out-of-Distribution AI-Generated Code Detection: A CodeBERT Fine-Tuning Study
Nahid Niyaz Shovon | Md. Naim Parvez
Nahid Niyaz Shovon | Md. Naim Parvez
We propose a CodeBERT-based system for detecting AI-generated code under severe cross-language and cross-domain distribution shift. Our approach conducts a controlled comparison between a fully frozen backbone and a partially fine-tuned configuration that unfreezes only the final transformer layer with discriminative learning rates. While partial fine-tuning substantially improves in-domain performance, the frozen backbone demonstrates stronger robustness under out-of-distribution evaluation. Our results highlight a trade-off between task adaptation and cross-language generalization in machine-generated code detection.
Pixel Phantoms at SemEval-2026 Task 13: Exploring Classical and Neural Approaches for AI-Generated Code Detection
Jithu Morrison S | Janani Hariharakrishnan | Angel Deborah S | Rajalakshmi S
Jithu Morrison S | Janani Hariharakrishnan | Angel Deborah S | Rajalakshmi S
This paper describes our system for SemEval-2026 Task 13, Subtask A: detecting whether a given code snippet is AI-generated or human-written. We explored a range of approaches from classical machine learning baselines using TF-IDF representations to fine-tuned transformer models pre-trained on code, specifically CodeBERT and GraphCodeBERT. Our experiments revealed a notable degradation in model performance when CodeBERT was trained beyond an optimal number of steps, indicating that continued training within an epoch leads to overfitting or representation drift. GraphCodeBERT, by contrast, yielded our best submission with a macro F1 score of 0.36866. Our findings highlight the sensitivity of code-specific transformers to training duration and suggest that early checkpoint selection is critical for this task.
schmerle at SemEval-2026 Task 4: Exploring Large Language Model Prompting Strategies for Low-Resource Narrative Similarity Detection
Maximilian Schmerle | Nils Constantin Hellwig
Maximilian Schmerle | Nils Constantin Hellwig
Narrative similarity detection has broad applications in plagiarism detection, content recommendation, and comparative narrative analysis. We present a training-free, prompting-only framework for SemEval-2026 Task 4 (Track A), which requires identifying which of two candidate stories is narratively more similar to a given anchor story. Without any fine-tuning or additional annotations, we systematically evaluate three prompt templates across five structural prompting strategies, including zero-shot and few-shot inference, narrative summarization, keyword extraction, aspect splitting, and pairwise comparison. Structured prompt templates and decomposed pairwise comparisons consistently outperform baseline configurations, achieving a peak accuracy of 72.50% on the test set and 67.75% on the final leaderboard (23th out of 44 teams).
Team UBSE at SemEval-2026 Task 4: Adapting Generalist Embeddings for Narrative Representations
Marius Marogel | Marius Popescu
Marius Marogel | Marius Popescu
The Narrative Story Similarity and Narrative Representation Learning (NSNRL) task measures the narrative similarity between two stories based on three core aspects: the abstract theme, the course of action, and the outcomes. Our system leverages LLMs both for extracting high-level aspects and to encode them with state-of-the-art generalist embedding models. We then apply a series of embedding post-processing steps and learn to fit the embedding space with a Mahalanobis-like diagonal metric. We show that some of these techniques should not be applied universally, as they do not necessarily increase performance or overfit, depending on the base encoder. Our system outperforms the baseline only in Track B, ranking twelfth out of twenty-seven on the final leaderboard, while performing lower than the baseline accuracy in Track A.
This paper presents our solution for subtask2, which focuses on the automated detection of conspiracy in text. Unlike traditional toxic text detection, conspiracy identification is particularly challenging as it often lacks explicit semantic indicators. To address this, we leveraged a Large Language Model (LLM) as our backbone and employed Low-Rank Adaptation (LoRA) for fine-tuning to enhance detection performance. To generate probabilistic confidence scores while maintaining the generative capabilities of the LLM, we implemented a hybrid loss function that integrates both generative and token classification losses. Additionally, semi-supervised learning with unlabeled data was incorporated to further refine classification accuracy. Our approach achieved a test accuracy of 0.87, ranking 2nd in both stages of the competition leaderboard.
Taien at SemEval-2026 Task 9: Multilingual Polarization Detection Using Transformer-based Models
Saida Taien | Palash Hossen
Saida Taien | Palash Hossen
This submission describes a multilingual polarization detection system for SemEval-2026 Task 9. The system leverages parallel fine-tuning of XLM-RoBERTa and mDeBERTa-v3 transformer models with a probability-level ensemble to improve prediction reliability. We employ language-independent preprocessing, subword tokenization, and a standardized classification head for all 22 languages to ensure a consistent modeling framework across the multilingual setting. Experimental results demonstrate strong performance on both high-resource and low-resource languages, highlighting the effectiveness of the ensemble approach in stabilizing predictions and improving multilingual polarization detection.
Clutch or Cry at SemEval-2026 Task 12: Offline Retrieval-Augmented Generation with Frozen DeBERTa for Abductive Event Reasoning
Aayush Prasad | Rudra Trivedi | Arshad Khatib | Shrikant Malviya | Naveen Kumar
Aayush Prasad | Rudra Trivedi | Arshad Khatib | Shrikant Malviya | Naveen Kumar
We present our system for SemEval-2026 Task 12 on abductive event reasoning. Initial experiments with direct fine-tuning of large language models suffered from severe overfitting due to limited training data, while smaller models failed under context-length constraints, leading to random guessing under the strict Exact Match evaluation metric. To address these challenges, we propose a two-stage offline Retrieval-Augmented Generation (RAG) pipeline that separates semantic evidence retrieval from multi-label classification. We employ a dense retriever (all-MiniLM-L6-v2) to extract the single most relevant sentence (top-k=1) and feed it into a partially frozen DeBERTa-v3-Large classifier trained with BCEWithLogitsLoss. Freezing the lower 12 layers effectively mitigates overfitting while preserving pre-trained semantic knowledge. Our approach eliminates long-context truncation issues, reduces hallucination, and achieves a final Exact Match accuracy of 0.72 on the official test set.
transformer_1376 at SemEval-2026 Task 9: A Multi-Stage Pipeline with Calibrated Ensembles and Lexical Post-Processing for Online Polarization Detection in Bengali
Shuvodwip Saha | Pritha Saha
Shuvodwip Saha | Pritha Saha
The POLAR @ SemEval-2026 Task 9 deals with the detection of online polarization in a variety of multilingual and multicultural environments. Our team participated in Subtask 1 of the POLAR @ SemEval-2026 Task 9, which mainly deals with binary classification of textual sequences for the detection of polarized stances. In this paper, we proposed a strong classification system for Bengali language based on fine-tuning the BanglaBERT Large model. The methodology used here involves a stratified five-fold cross-validation approach along with a performance-weighted ensemble method, combined with temperature scaling probability calibration and a set of lexical post-processing rules.
Team Yuvan at SemEval-2026 Task 13: Task-Adaptive Ensemble Strategies for AI-Generated Code Detection
Yuvan Ramesh | Tongtong Wu
Yuvan Ramesh | Tongtong Wu
We describe our system for SemEval-2026 Task 13 on detecting machine-generated code across eight programming languages and three subtasks: binary human-vs-AI detection, 11-way source identification, and 4-way generator classification. Our approach uses a task-specific combination of Qwen2.5-Coder-1.5B with LoRA fine-tuning, abstract syntax tree (AST) features, CodeBERT with head-tail chunking, and TF-IDF features. Experiments reveal three main findings. For Task A, neural detectors degrade markedly on the official test split, while AST-based structural features remain more stable, suggesting substantial distribution shift. For Task B, inverse-frequency class weighting is essential under extreme label imbalance and greatly improves macro-F1. For Task C, combining neural and statistical models performs better than relying on a single model alone, indicating complementary strengths across representations. Our final system achieves 0.638 macro-F1 on Task A, 0.449 macro-F1 on Task B, and 0.714 macro-F1 on Task C, offering practical insights into robustness, imbalance handling, and model complementarity for AI-generated code detection.
MedHastra at SemEval-2026 Task 13: Stylometric Ensembles and Transformer Fine-Tuning for Robust AI Code Detection, Attribution, and Adversarial Analysis
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
This paper describes Team MedHastra’s submission to SemEval-2026 Task 13 on detecting machine-generated code across diverse programming languages, generators, and application scenarios. We participated in all three subtasks: (A) binary detection of AI-generated code under out-of-distribution conditions, (B) multi-class attribution across ten large language model families, and (C) classification of human, fully AI-generated, hybrid, and adversarial code.For Subtask A, we implemented a stylometric ensemble combining structural formatting features with word- and character-level TF-IDF representations, trained using Random Forest, Gradient Boosting, and Logistic Regression with soft voting. For Subtasks B and C, we fine-tuned CodeBERT to leverage contextual code representations, incorporating class balancing strategies such as downsampling and weighted cross-entropy.Our results demonstrate that handcrafted stylometric features struggle under strong distribution shift, while transformer-based contextual modeling is more effective for fine-grained attribution and hybrid/adversarial detection. The study highlights the importance of robust contextual representations for realistic AI-assisted programming scenarios.
Team Duo at SemEval-2026 Task 13: Fine-tuning CodeBERT for Out-of-Distribution AI-Generated Code Detection
Subhiksha G | Sanjai M | Rajalakshmi Sivanaiah | Angel Deborah S
Subhiksha G | Sanjai M | Rajalakshmi Sivanaiah | Angel Deborah S
This paper addresses detecting AI-generated code in out-of-distribution settings by fine-tuning CodeBERT on algorithmic code from C++, Python, and Java. While the model achieves near-perfect performance on training data (F1 = 0.9935), it degrades significantly on unseen languages and domains (F1 = 0.3532). The high recall (0.8789) but low precision (0.2210) indicates over-prediction of machine-generated code. Error analysis reveals three failure modes: domain mismatch, unfamiliar syntax patterns, and insufficient training. Multi-epoch training and domain-specific augmentation are needed to improve OOD generalization.
Segmentation Fault at SemEval-2026 Task 13: A Regularization-First Approach with Generator-Based Out-of-Distribution Splits for Detecting AI-Generated Code
Lakshmi Priya Swaminatha Rao | Dhannya Santhakumari Madhavan | Sreya Kodeswaran | Nithila R | Kanmani R
Lakshmi Priya Swaminatha Rao | Dhannya Santhakumari Madhavan | Sreya Kodeswaran | Nithila R | Kanmani R
This paper describes our submission to SemEval-2026 Task 13 (Subtask A) on detecting AI-generated code. We fine-tune CodeBERT-base using a generator-aware out-of-distribution (OOD) validation split to better simulate unseen test generators. Strong regularization techniques, including stochastic data augmentation, dropout, weight decay, and label smoothing, are applied to prevent overfitting to generator-specific patterns. Experiments with logistic regression, UniXcoder, and vanilla CodeBERT reveal that evaluation design has a larger impact on generalization than model scale or training data volume. Our final system achieves a macro F1 score of 0.439 on the hidden test set, representing a 62% relative improvement over unregularized baselines.
TechSSN at SemEval-2026 Task 8: MTRAG Retrieval and Generation using Ensemble Re-encoders and Anchor Prompting
Anne Jacika J | Anishka K | Guruprakash K | Rajalakshmi Sivanaiah | Angel Deborah S
Anne Jacika J | Anishka K | Guruprakash K | Rajalakshmi Sivanaiah | Angel Deborah S
This paper discusses the Retrieval-Augmented Generation (RAG) system submitted to the MTRAG-UN shared task on multi-turn conversational question answering. The paper describes the proposed solution for Task A (Document Retrieval) and Task C (Full RAG Pipeline), focusing on retrieval robustness and grounded response generation in complex English multi-turn dialogs. The proposed retrieval architecture uses a cascaded hybrid pipeline, which combines sparse retrieval (BM25) with dense bi-encoder models (BGE-base-en-v1.5 and E5-base), integrated via Reciprocal Rank Fusion and refined using a weighted ensemble of cross-encoders. For the generation part, the top-3 retrieved passages are injected into FLAN-T5-Large using an anchor-prompting strategy to output grounded faithful responses. Experimental results show that the proposed hybrid retrieval framework with multi-stage reranking significantly enhances passage selection, particularly for non-standalone conversational queries. Further analysis reveals persistent difficulties in handling underspecified and unanswerable questions, as well as an increased susceptibility to retrieval noise in later dialog turns.
DataBees at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Tanisha Sriram | Sathvika Shankar | Sowmya Anand | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thankanadar
Tanisha Sriram | Sathvika Shankar | Sowmya Anand | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thankanadar
This paper describes our submission toSemEval-2026 Task 9, Subtask 1: Multilingual Text Classification Challenge — Polarization Detection. Our focus is on how classicaland transformer-based models compare whenapplied to multilingual polarization detection.We aim to understand where each type tendsto do well and where it breaks down, particularly once you move from high-resource tolow-resource settings. Our experimental setupevaluates classical machine learning models(TFIDF with Naive Bayes, Logistic Regression, and Linear SVM) alongside languagespecific transformer models across multiplelanguages. For Arabic, Bengali, German, Italian, and Spanish, we leveraged both multilingual and monolingual pre-trained transformers such as mBERT, XLM-R, AraBERTv2,BanglaBERT, and BETO. We compare individual classical and transformer-based modelsto identify which modeling choices work bestfor each language. Our results varied substantially across languages. We achieved our bestleaderboard rankings in Bengali (6th out of 48teams) and Italian (6th out of 43 teams), whileperformance was lower in Arabic (33rd out of44), German (41st out of 44), and Spanish (46thout of 48). The study highlights the value ofcomparing classical and transformer-based approaches for multilingual polarization detectionand identifies language-specific challenges forfuture improvement.
Team Habib Disambiguators at SemEval-2026 Task 5: Assessing Semantic Plausibility using Regularized Transformer Fine-Tuning
Zohaib Aslam | Ahsan Siddiqui | Ayesha Enayet
Zohaib Aslam | Ahsan Siddiqui | Ayesha Enayet
This paper presents a system for SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding. The task involves predicting the plausibility of a specific word sense within a short story where context provided by the ending resolves a deliberate ambiguity. We model this as a regression problem, fine-tuning a DeBERTa-v3 transformer to predict the distribution of human judgments rather than a single hard label. To address the challenge of limited training data and potential overfitting, we employ R-Drop (Consistency Regularization) to enforce prediction stability across dropout masks and Layer-wise Learning Rate Decay (LLRD) to preserve the model’s pre-trained linguistic knowledge. Our experiments demonstrate that treating plausibility as a soft-label distribution, combined with aggressive regularization, improves generalization on ambiguous samples. The submitted system achieves a Spearman correlation of 0.56 and an Accuracy (within SD) of 0.74 on the official test set.
Ellat at SemEval-2026 Task 11: Comparing Encoder and Decoder Models for Syllogistic Reasoning
Farzaneh Bayan Memar | Hanneke Huls | Matthijs Ten Hove
Farzaneh Bayan Memar | Hanneke Huls | Matthijs Ten Hove
For SemEval-2026 Task 11 (Subtask 1: English), Team Ellat investigates whether language models can assess logical validity independently of semantic plausibility. Since these models learn statistical patterns instead of explicit logical rules, they often rely on world knowledge and semantic shortcuts rather than formal logic. To address this challenge, we evaluate three architectures: MiniLM-L6-mnli-binary, DeBERTa-v3-small, and Llama 3.1-8B-Instruct, applying task-specific fine-tuning for encoder models and Abstract Logic Augmentation with QLoRA for LLaMA. DeBERTa achieved the strongest overall performance, MiniLM showed clear reductions in content bias after fine-tuning, and Llama 3.1-8B exhibited strong plausibility bias in the zero-shot setting. However, our augmented fine-tuning approach led to only modest improvements and a partial shift toward structure-based reasoning. Overall, fine-tuning and abstraction-based augmentation reduce plausibility bias, but fully separating logical validity from semantic content remains challenging across architectures.
AI-Monitors at SemEval-2026 Task 4: A Hybrid Embedding and LLM Ensemble for Narrative Similarity
Vishnu Tripathi | Azad - | Prakhar Joshi | Pragyananda Sahoo | Gaurav Kumar | Piyush Arora | Neel Mani
Vishnu Tripathi | Azad - | Prakhar Joshi | Pragyananda Sahoo | Gaurav Kumar | Piyush Arora | Neel Mani
Narrative similarity requires reasoning over the deeper structural properties of stories - shared themes, causal progression, and outcomes - rather than surface-level lexical overlap. We describe AI-Monitors, our system for SemEval-2026 Task 4 (Track A), which determines which of two candidate stories is more narratively similar to a given anchor. We explore a progression of approaches - from embedding-based similarity to structured LLM prompting and ensemble construction - guided by four hypotheses about where narrative reasoning gains can be found. The final system achieves 75\% test accuracy on 400 instances, ranking 3rd out of 47 systems and approaching the individual human annotator ceiling of 78\%.Our key findings are: i) structured few-shot prompting substantially outperforms dense embedding similarity; ii) selecting ensemble components by how differently they make errors - rather than by accuracy alone - produces stronger predictions; and iii) how you describe an example to the model affects its predictions.
Team ewelinaksiez at SemEval-2026 Task 11: Reducing Content Bias in Syllogistic Reasoning via Semantic Abstraction
Ewelina Księżniak
Ewelina Księżniak
This paper presents our system for SemEval-2026 Task~11 Subtask~1 on content-independent syllogistic reasoning. The task evaluates whether language models can determine the formal validity of logical arguments independently of their semantic plausibility. To reduce content-driven biases, we propose a data augmentation strategy that progressively abstracts lexical semantics by replacing content words with symbolic placeholders and pseudo-words while preserving logical structure. Experiments based on fine-tuning microsoft/deberta-large-mnli show that abstraction-based augmentation reduces Content Effect and improves accuracy, leading to competitive performance on the official leaderboard. However, we observe substantial sensitivity to random initialization, suggesting that evaluation outcomes are partly influenced by stochastic factors. To better understand these effects, we conduct a layer-wise probing analysis using a Minimum Description Length framework, showing that the proposed approach decreases the accessibility of plausibility information in later transformer layers, indicating a shift toward more structure-oriented reasoning.
TeamLasse at SemEval-2026 Task 3: A Hybrid Generative-Discriminative Framework for Dimensional Aspect-Based Sentiment Analysis
Lasse Strothe | Shaghayegh Kolli | Jana Diesner
Lasse Strothe | Shaghayegh Kolli | Jana Diesner
In this paper, we present our system for SemEval-2026 Task 3 Track A: Dimensional Aspect-Based Sentiment Analysis (DimABSA). The core objective is to extract structural sentiment elements—such as aspects, opinions, and categories—from text and predict their corresponding continuous Valence-Arousal (VA) scores. The primary challenge lies in simultaneously handling structural extraction and continuous numerical regression across highly imbalanced datasets encompassing multiple languages and domains. To address this complexity, we propose a decoupled, two-stage hybrid generative-discriminative framework. A generative Large Language Model first extracts structured sentiment tuples, while an encoder-based language model performs the continuous VA regression. To foster cross-lingual and cross-domain generalization, we train our models using a targeted data balancing mechanism.
CUNI at SemEval-2026 Task 4: Multi-Head Narrative Aspect Disentanglement via Entangled Synthetic Dataset
Jan Mitka | Jindrich Helcl
Jan Mitka | Jindrich Helcl
We participate in Track B of the SemEval 2026 Task 4 on narrative similarity, focusing on narrative representation learning. We introduce a synthetic dataset designed to disentangle core narrative aspects-abstract theme, course of action, and outcome-and propose a multi-head multi-positive extension of the InfoNCE objective to train aspect-specific embeddings. Our best model achieves 64.25\% accuracy on the test set. A nearest-centroid analysis indicates partial aspect-specific structure in the submitted checkpoint, while the training dynamics reveal a partial misalignment between the contrastive objective and the triplet-based evaluation protocol.
FMISUYotkovaKastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals
Elitsa Yotkova | Violeta Kastreva | Dimitar Dimitrov | Ivan Koychev | Preslav Nakov
Elitsa Yotkova | Violeta Kastreva | Dimitar Dimitrov | Ivan Koychev | Preslav Nakov
SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods.We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.
Thiyaga6851 at SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models using Neuro-Symbolic Mapping
Thiyagarajaa Pk | Thenmozhi D.
Thiyagarajaa Pk | Thenmozhi D.
This paper presents our system for SemEval-2026 Task 11 Subtask 1, which evaluates the formal validity of English syllogisms independently of semantic plausibility. To reduce content effects, we use a hybrid neuro-symbolic pipeline that separates natural-language abstraction from logical inference. The system maps each syllogism into categorical propositions using template rules and a learned parser, followed by explicit role mapping for the major, minor, and middle terms. If the abstraction is structurally complete, an exact Venn-style satisfiability solver checks validity; otherwise, the instance is routed to a learned fallback classifier. Our official submission achieved 71.73% accuracy, a Total Content Effect of 11.84, a Combined Score of 20.19, and a rank of 41st. Development analysis shows that symbolic inference is reliable on well-formed abstractions, while most remaining errors arise from paraphrase, multiword terms, and unstable term alignment.
LIAAD INESCTEC at SemEval-2026 Task 4: Unsupervised Narrative Similarity via Discourse Representation Structures and Sentence Embeddings
Evelin Amorim | Alípio Jorge | Purificação Silvano
Evelin Amorim | Alípio Jorge | Purificação Silvano
In this paper, we describe an unsupervised approach using Discourse Representation Structures (DRS) for the SemEval-2026 Task 4. This task was Narrative Similarity and was formulated in two different tracks. Our team only developed a solution for track A, where the input is composed of a triplet: an anchor story, a story A, and a story B. The output in this formulation is to predict which story, A or B, is more similar to the anchor story. Our approach parsed each story and transformed in a DRS format,then we leveraged its structure and extracted features, performing ablation experiments inthe development dataset. Our strategy achieved 0.5975 accuracy in the official blind test set.
HUS@NLP-VNU at SemEval-2026 Task 3: Dual-Stream Syntax-Aware Modeling and Direct Preference Optimization for Dimensional ABSA
An Cao | Lam Hoang | Le Ngoc Toan | Ha Linh
An Cao | Lam Hoang | Le Ngoc Toan | Ha Linh
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA by predicting continuous sentiment intensity in the Valence-Arousal space. To tackle the regression subtasks (DimASR and DimStance), we propose a Dual-Stream Syntax-Aware architecture synergizing contextual semantics with a Deep Syntax-Guided Graph Convolutional Network (GCN). It utilizes a Context-Aware Anchor for semantic filtering and post-norm residuals to prevent oversmoothing. For generative extraction, we apply Direct Preference Optimization (DPO) via a resource-efficient, heuristic-based data perturbation strategy to construct preference pairs without costly LLMs. Across multilingual settings, our regression model achieves top-5 rankings in nine domains and obtains the best result on the Chinese-Finance dataset. Empirical analysis shows that explicit syntactic modeling consistently improves continuous sentiment regression, while DPO provides modest but stable gains for boundary-constrained extraction.
PolAR Bears at SemEval-2026 Task 9: Parameter-Efficient Fine-Tuning and Cross-Lingual Augmentation for Multilingual Polarization Detection
Vinay Ulli | Jyoti Kumari
Vinay Ulli | Jyoti Kumari
This paper describes our system for SemEval-2026 Task 9: Detecting Multilingual, Multicul-tural and Multievent Online Polarization. Wefocus on four low-resource Indian languages(Hindi, Bengali, Telugu, and Odia) across threesubtasks: Polarization Detection, Type Classi-fication, and Manifestation Identification. Toaddress data scarcity, we employ cross-lingualdata augmentation using IndicTrans2, expand-ing our dataset fourfold. Our unified architec-ture leverages Qwen3-4B-Instruct optimizedvia QLoRA, training a linear classification headon masked mean-pooled hidden states withonly ∼33M trainable parameters. Our systemachieved highly competitive results in Subtask1, with an average Macro F1 of 0.813 across alllanguages (peaking at 0.8668 for Telugu). Forthe complex multi-label frameworks of Sub-tasks 2 and 3, our results expose a significantpre-training bias within foundational LLMs;while Hindi maintained strong F1 scores of0.7008 and 0.7248, performance dropped con-siderably for the other three languages, high-lighting the ongoing challenges of cross-lingualtransfer for nuanced rhetorical techniques.
Rasende Rakete at SemEval-2026 Task 6: LLM-First Approach with Iterative Prompt Repair for Classifying Evasion in Political Interviews
Omar Elbeltagui | Nils Knittel | Leonie Süß | Umut Yıldırır | Qiyan Zhai | Shaghayegh Kolli | Jana Diesner
Omar Elbeltagui | Nils Knittel | Leonie Süß | Umut Yıldırır | Qiyan Zhai | Shaghayegh Kolli | Jana Diesner
We describe our system for SemEval-2026 Task 6 (CLARITY), which addresses automatic detection of evasive responses in political interviews. We adopt an LLM-first approach built around two core contributions: (i) an iterative prompt repair loop that diagnoses classification errors on concrete failure examples and applies prompt revisions and (ii) a configurable end-to-end Java Pipeline that supports multiple LLM providers, strategies, and systematic experimentation.
hdharpure at SemEval-2026 Task 3: BERT-Based Modeling and Prediction Behavior Analysis for Multilingual Valence–Arousal Scoring
Harshal Dharpure | Nicolay Rusnachenko
Harshal Dharpure | Nicolay Rusnachenko
The SemEval-2026 Task 3 is a Dimensional aspect-based sentiment analysis (DimABSA) task which extends traditional ABSA by predicting continuous regression in two dimensions: valence (V) and arousal (A). The Track A/Subtask 1 represent multilingual task in which for a given text and aspects mentioned in it, there is a need to predict V/A scores for each aspect. Our approach is based on the pretraining-finetuning concept: we first pretrain multilingual model (M ′) followed by its fine-tuning (M ′′ l,d) on the training data of specific domain (d) and language (l). We adopt XLM-RoBERTa (M ) as the encoder with separate regression heads for valence and arousal prediction. Our experiments on manual split of official SemEval-2026 Task 3 dataset (D20% train) demonstrate that fine-tuning model in two stages (M ′′ l,d) results in average ≈ 1.36 times improvement by RMSEva over approach of direct fine-tuning (Ml,d). To investigate limitations of the existing approach we visualize and discuss limitations of our system. Our code is publicly available.
Bitzkrieg at SemEval-2026 Task 13: Calibration-Aware Dual CodeBERT for Multilingual Machine-Generated Code Detection
Thenmozhi D. | Adithya S | Harshil Malisetty | Aadit P | Rohan R
Thenmozhi D. | Adithya S | Harshil Malisetty | Aadit P | Rohan R
We describe our submission to SemEval-2026 Task 13, addressing binary detection (Subtask A), generator attribution (Subtask B), and hybrid/adversarial authorship classification (Subtask C) of machine-generated code (MGC). For Subtask A, we fine-tune two CodeBERT models with complementary sampling strategies and apply percentile-based post-hoc calibration, improving Macro-F1 from 0.47 to 0.56 without additional training. For Subtask B, we combine TF-IDF n-grams, frozen CodeBERT embeddings, and language features with XGBoost, using synthetic augmentation and class weighting to handle an 11-class dataset skewed 88% toward the human class, achieving Macro-F1 of 0.289. For Subtask C, we fine-tune a CodeBERT classifier for four-way authorship classification, achieving Macro-F1 of 0.49. Our results highlight the importance of probability calibration for binary detection and class balancing for multi-class attribution.
harapalb at SemEval-2026 Task 4: Multi-Signal Neuro-Symbolic Ensembles for Narrative Similarity
Andrei Tiberiu Carp
Andrei Tiberiu Carp
This paper presents a neuro-symbolic ensemble for determining narrative similarity by moving beyond surface-level text matching toward structural and causal alignment. The architecture fuses three primary signals: action-focused neural embeddings that isolate event trajectories , a symbolic Structural Survival Ratio (SSR) that measures the preservation of discrete event tuples via dependency parsing , and high-level structural comparisons conducted by the gpt-5-mini model. Evaluated on the SemEval-2026 Task 4 test set, the integrated ensemble achieved an accuracy of 68.25%.
SemTechLab at SemEval-2026 Task 5: Context-Aware Homonym Disambiguation via Span-Specific Interaction Features
Karlo Babić | Ana Meštrović | Slobodan Beliga
Karlo Babić | Ana Meštrović | Slobodan Beliga
This paper presents the SemTechLab system submitted to SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding. The task involves predicting the plausibility of a specific word sense given a short story context. Our approach (HINTS) utilizes a hybrid Transformer architecture based on nli-mpnet-base-v2. Unlike standard Cross-Encoders that rely solely on the [CLS] token, HINTS extracts span-specific embeddings for the target homonym from both the narrative context and the sense definition. We compute interaction features (concatenation, difference, and element-wise product) between these spans to explicitly model the semantic alignment between the story and the proposed sense. The model is trained using Kullback-Leibler Divergence to predict the full distribution of human ratings. For the official submission phase, scores were rounded to integers (1–5). However, subsequent analysis and ablation studies detailed in this paper utilize continuous (float) scores derived from the expected value for improved metric sensitivity. On the test set, our best configuration, which relies exclusively on local homonym features, achieved a Spearman correlation of 0.603 and an accuracy of 75.8%.
SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance
Hanna Abi Akl | Fabien Gandon | Catherine Faron | Pierre Monnin
Hanna Abi Akl | Fabien Gandon | Catherine Faron | Pierre Monnin
This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.
Aaron at SemEval-2026 Task 9: Multilingual Polarization Detection using Transformer-Based Models with Class Weighting and Threshold Tuning
Aaron Anampiu
Aaron Anampiu
This paper describes our submission to SemEval-2026 Task 9 on detecting multilingual, multicultural, and multievent online polarization. We address all three subtasks: binary polarization detection, polarization type classification, and manifestation identification for English and Swahili. Our approach leverages transformer-based models (RoBERTa-base for English, AfroXLMR-base for Swahili) with class-weighted loss functions to address severe label imbalance and per-label threshold tuning to optimize multi-label classification. On the test set, we achieve F1 macro scores of 0.7901 (English) and 0.7910 (Swahili) for Subtask 1, 0.4615 (English) and 0.4808 (Swahili) for Subtask 2 and 0.4791 (English) and 0.5830 (Swahili) for Subtask 3, which give competitive performance on the leaderboard, demonstrating the effectiveness of our methods for handling imbalanced multi-label polarization detection. Our error analysis reveals that models struggle with dehumanization detection and lack of empathy.
OseiBrefo-Liang at SemEval-2026 Task 12: Hybrid Causal Knowledge Graphs and Neural-Symbolic Policy Optimisation for Abductive Event Reasoning
Emmanuel Osei-Brefo | Huizhi(elly) Liang
Emmanuel Osei-Brefo | Huizhi(elly) Liang
Abductive Event Reasoning (AER) requires selecting plausible causal explanations for observed events from incomplete and noisy textual evidence. Unlike deductive reasoning, abductive inference proceeds from effects to candidate causes and is highly sensitive to distractor information and implicit multi-hop relationships. We present a hybrid neural-symbolic framework that models abductive reasoning as structured causal validation rather than unconstrained generation. Our framework integrates hybrid retrieval, micro-level evidence grounding, concept-level causal abstraction, reinforcement learning-based decision calibration, and structured Theorem-of-Thought verification. Experiments on SemEval-2026 Task 12 show that LLM reasoning constrained by structured causal graphs achieves the strongest development performance of 0.5288 and a leaderboard score of 0.61 on the test set, substantially outperforming symbolic-only and policy-only variants. These findings indicate that explicit causal modelling improves robustness in document-grounded abduction tasks.
Team Poznan at SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Dawid Siera | Anatol Kaczmarek | Wiktor Kamzela | Adam Dobosz | Jakub Dutkiewicz
Dawid Siera | Anatol Kaczmarek | Wiktor Kamzela | Adam Dobosz | Jakub Dutkiewicz
Detecting machine-generated code is crucial for maintaining software security and quality. Traditional approaches often rely on stylistic or statistical features, which are increasingly circumvented by advanced code generation models. This paper introduces a novel approach leveraging Graph Neural Networks (GNNs) to capture the structural characteristics of code, representing it as a program dependency graph. We demonstrate that our GNN-based classifier outperforms both traditional and embedding based methods on benchmark datasets, achieving improved accuracy and robustness in identifying code produced by various generation techniques. This work highlights the potential of GNNs for a more structural understanding of code authorship.
X-NLP at SemEval-2026 Task 12: Prompting LLMs for Abductive Event Reasoning
Caelen Mattie | Patrick Bowen | Milton King
Caelen Mattie | Patrick Bowen | Milton King
In this work, we applied two different systems to the SemEval 2026 Shared Task 12, which exploresabductive event reasoning. Specifically, this task involves determining the cause of an event from a list of candidate causes. Instances are accompanied with documents that can provide applicable knowledge for the target event. Both of our systems involve prompting LLMS and our best performing system leverages retrieval-augmented generation. Our best performing system achieved a score of 84% and ranked 40th out of the 221 total submissions.
COGNAC at SemEval-2026 Task 4: Evaluating Narrative Components with LLMs for Hard Story Similarity Cases
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
This paper presents a two-stage system for the SemEval-2026 shared task on narrative similarity. The task defines similarity in terms of three components: abstract theme, course of action, and outcome. For Track A, the system first applies majority voting over multiple independent large language model (LLM) judgments to handle high-agreement (easy) cases. For low-agreement (difficult) cases, it routes examples to a second stage that decomposes stories into theme, course of action, and outcome, and either (i) scores these components individually with learned weights or (ii) uses structured chain-of-thought prompting to compare stories along the three dimensions. This two-stage approach improves robustness on difficult examples and achieves first place with 0.78 test accuracy. For Track B, the system generates embeddings of full stories and of individual narrative components using several embedding models. Experiments show that embeddings derived from the course-of-action component alone yield the best performance, achieving 0.72 accuracy and ranking first. Additional analyses reveal substantial annotation variability in the dataset and highlight the importance of handling ambiguity and disagreement when modeling narrative similarity.
L52+-IIMAS-UNAM at SemEval-2026 Task 1 (MWAHAHA): Joke Selection Through a Multi-Stage Prompt-Engineering and Heuristic Pipeline
Adolfo Tonatihu Camacho Gonzalez | Ximena Cruz | Natalia Godínez-Aldana | Lizeth Palacios-Patiño | Ramón Rangel | Ivan Vladimir Meza Ruiz
Adolfo Tonatihu Camacho Gonzalez | Ximena Cruz | Natalia Godínez-Aldana | Lizeth Palacios-Patiño | Ramón Rangel | Ivan Vladimir Meza Ruiz
Humor generation remains one of the most challenging tasks in natural language processing, requiring creativity, incongruity resolution, cultural sensitivity, and strict structural control. We present a fully prompt-based system for headline-conditioned joke generation in SemEval-2026 Task 1 (MWAHAHA) for both English and Spanish. Deliberately avoiding fine-tuning, our approach relies on structured prompt engineering combined with a multi-stage heuristic pipeline. For Spanish, we extract a “stylistic-humor DNA” from a public joke corpus to guide generation. The pipeline integrates multi-candidate generation, diversity enhancement, iterative refinement, LLM-based rewriting, and constraint-aware selection. Human evaluation performed by the team (n=180) shows substantial gains over single-pass generation, particularly in funniness and punchline clarity. Official shared-task results were modest (12th/16 Spanish, 24th/31 English), underscoring that limited originality remains a key bottleneck. In an era dominated by large language models (LLMs) such as GPT-4o and Grok, our work demonstrates the value of linguistically grounded heuristics as an efficient, interpretable, and low-cost complement to black-box generation systems.
Rating Plausibility of Word Senses in Ambiguous Sentences Using Multi-Architecture Analysis
Naina Jain | Nidhi Arora | Pal Thakkar | Siba Sahu
Naina Jain | Nidhi Arora | Pal Thakkar | Siba Sahu
Word sense disambiguation in narrative contexts requires systems to reason about subtle semantic relationships between candidate senses and discourse context. This paper addresses SemEval 2026 Task 5, which reformulates WSD as a graded plausibility prediction problem on a 1–5 Likert scale using the AmbiStory dataset. We present two complementary approaches: (1) a DeBERTa-v3-Large encoder with attention-weighted pooling and ordinal regression, achieving a Spearman correlation of 0.718, and (2) a rank-based ensemble combining FLAN-T5 and RoBERTa, achieving 0.692. Through ablation studies, we show that explicitly modeling ordinal structure improves performance over standard regression by 17.3%. We further analyze the strengths of each approach, showing that fine-tuned encoders capture fine-grained semantic relationships, while ensemble methods provide robustness through complementary modeling biases. Our results provide a detailed empirical analysis of design choices for graded plausibility prediction in narrative understanding.
MindFlayer at SemEval-2026 Task 8:DUALRAG:Answerability-Aware Generation for Multi-Turn RAG Conversations
Jerin Romijah Tuli | Md. Sartaj Alam Pritom | Talukder Naemul Hasan Naem
Jerin Romijah Tuli | Md. Sartaj Alam Pritom | Talukder Naemul Hasan Naem
Our system, DualRAG (team MindFlayer), tackles SemEval-2026 Task 8 Subtask B - generating faithful responses in multi-turn RAG conversations. The core idea is simple: before generating anything, we first check whether reference passages exist for the current question. If they do, we route through a domain-guided generation prompt that instructs the model to answer using only those passages. If they do not, we route through a strict refusal prompt that tells the model to politely decline rather than guess.We used Meta’s Llama-4-Scout-17B through the Groq API, with no training or fine-tuning - purely zero-shot prompting. A lightweight post-processing layer catches the rare cases where the model ignores its instructions: if it refuses when passages are available, we replace the response with a neutral fallback; if it answers when no passages exist, we replace it with a standard refusal. Out of 507 test tasks, only 7 needed this correction.The system ranked 8th out of 26 teams with a harmonic mean of 0.7492, beating the strongest baseline (GPT-OSS-120B at 0.639) by a notable margin. The standout result is 100% refusal accuracy on all 130 unanswerable questions - something even GPT-4o and Llama 3.1 405B failed to achieve consistently according to prior work. Our RLF score of 0.8782 shows the responses stay tightly grounded in the reference passages. The relatively lower RBagg (0.6024) reflects the challenge of matching human-written phrasing in a zero-shot setting, which we identify as the clearest direction for improvement.
MindFlayer at SemEval-2026 Task 13:LACR-ENS: Calibration-Aware Ensemble Routing for Cross-Language AI-Generated Code Detection
Jerin Romijah Tuli | Talukder Naemul Hasan Naem | Md. Sartaj Alam Pritom
Jerin Romijah Tuli | Talukder Naemul Hasan Naem | Md. Sartaj Alam Pritom
This paper presents LACR-ENS, a calibration-aware ensemble system for detecting AI-generated code across eight programming languages (SemEval-2026 Task 13). We identify a severe asymmetric out-of-distribution (OOD) failure in fine-tuned code transformers: Expected Calibration Error doubles from 0.09 (seen languages) to 0.18 (unseen languages), and high-confidence predictions (p0.80) are wrong 39% of the time on OOD inputs. We propose Language-Aware Confidence Routing (LACR), formally equivalent to implicit per-language temperature scaling, which reduces OOD ECE to 0.11 and improves macro-F1 by +0.013 over fixed-threshold ensembling. A language-family proximity analysis reveals that syntactic distance to training languages predicts OOD F1 with Pearson r=+0.94, providing a principled, label-free signal for deployment risk assessment and motivating a continuous routing extension. Our system combines UniXCoder and GraphCodeBERT via weighted logit-level fusion and achieves macro-F1 0.538 , outperforming comparable encoder-only systems. We additionally document a HuggingFace label inversion pitfall that suppressed our initial score by approximately 0.29 F1.
abateam at SemEval-2026 Task 1: Plan2joke – Humor Policies for Type-Specific Two-Pass Humor Generation
Andrii Dikhtiar | Antonii Viter | Bohdan Karaziia | Daryna Dementieva | Alexander Fraser
Andrii Dikhtiar | Antonii Viter | Bohdan Karaziia | Daryna Dementieva | Alexander Fraser
Our work was inspired by several recent directions in computational humor and evaluation, including:- Baranov, Kniazhevsky, and Braslavski, "You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models" (2023).- Tikhonov and Shtykovskiy, "Humor Mechanics: Advancing Humor Generation with Multistep Reasoning" (2024).- Zhong, Huang, Gao, Wen, Lin, Zitnik, and Zhou, "Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation" (2023).At a high level, we designed a policy-driven humor generation approach covering multiple humor types. We used optimal humor recognition systems and a context enrichment strategy, as well as SFT training based on a dataset composed from previous research samples and adjusted for alignment with our humor policies. This allowed us to perform an ablation study of the approach and to calibrate our system.
Lacuna Inc. at SemEval-2026 Task 4: Structurally Gated State-Space Models for Disentangling Narrative Similarity
Aleksey Kudelya | Rafif Alshawi | Alexander Shirnin
Aleksey Kudelya | Rafif Alshawi | Alexander Shirnin
In this paper, we present the Invariant-Variant Disentangled State-Space Model (IVD-SSM),our submission to SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning. Evaluating narrative similarity is a profound computational challenge that requires models to look past concrete, superficial elements such as specific names, actors, objects, or settings to isolate and compareabstract patterns of causality and plot progression. To model these extended causal chainswithout the quadratic bottlenecks of standard Transformers, we leverage a hybrid State-SpaceModel (Jamba-1.5-Mini). Building upon this backbone, we introduce the Structurally Gated Alignment (SGA) head, a novel, differentiable algorithmic architecture. The SGA head operates on two scales: a heavily strided Macro-path maps the coarse structural skeleton of a story, which then acts as a gating mechanism to filter a full-resolution Micro-path, actively suppressing semantic noise and superficial keyword overlaps. Evaluated on both pairwisecomparative judgments (Track A) and dense representation learning (Track B), our approach demonstrates that explicitly disentangling structural invariants from lexical variants provides a robust, principled framework for deep narrative understanding.
IReLIIT(BHU) at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Soumadip Majumder | Arjun Mukherjee | Krishna Tewari | Sanjaya Lenka | Sukomal Pal
Soumadip Majumder | Arjun Mukherjee | Krishna Tewari | Sanjaya Lenka | Sukomal Pal
This paper presents the IReLIIT(BHU) submission to SemEval-2026 Task 9 for the Chinese language track. We participated in all three subtasks: binary polarization detection,multi-label polarization type classification, and multi-label manifestation identification. Our approach is based on a unified transformer based framework with cross-validation, prediction aggregation, and threshold optimization to improve robustness across tasks. On the official evaluation, our systems achieved Macro-F1 scores of 0.9081, 0.7962, and 0.6484 for Subtasks 1, 2, and 3, respectively on test data.
WWTC@UniA at SemEval-2026 Task 13: BERT-based Code Authorship Detection and Qualitative Analysis
Linda Kupfer | Lisa Hader | Christian Jaumann | Annemarie Friedrich
Linda Kupfer | Lisa Hader | Christian Jaumann | Annemarie Friedrich
This paper describes our system for SemEval-2026 Task 13 on detecting machine-generated code. We fine-tune small encoder-only models for detecting human-written versus machine-generated code and for identifying which large language model (LLM) family was used to obtain code. We find that a strong, general-purpose model (ModernBERT) outperforms models specifically pre-trained for the code domain. In the official evaluation, our system ranks 5th on subtask B and 6th on subtask C. Our detailed analysis reveals that comments and other natural language text that is part of the code snippets provide valuable information for identifying the LLM family that generated it. Moreover, we show that the embeddings of our finetuned ModernBERT do not distinguish well between LLM families, but they cluster human-written code by programming language.
Spinfo Cologne at SemEval-2026 Task 4: Explainable Creation of Narrativity Embeddings
Janis Pagel | Nils Reiter
Janis Pagel | Nils Reiter
We describe our submission to SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning.The task requires (i) selecting, for a given anchor story summary, which of two candidate summaries is narratively closer (Track A) and (ii) producing a narrative representation of a story as a vector embedding (Track B).Our approach emphasizes interpretability by explicitly eliciting three narrativity aspects with a prompted large language model.We then construct a fixed-size narrative embedding by concatenating aspect-wise representations, comparing a static-embedding baseline (GloVe) to contextualized sentence-transformer embeddings (all-MiniLM-L6-v2).On the development set, the sentence-transformer variant outperforms the static baseline and achieves 61.5% accuracy on Track A, while the GloVe variant performs near chance.Our official submission reaches 60.25% accuracy on the Track A test set and 57.75% accuracy on Track B.Additional ablations show that the aspect pipeline slightly outperforms raw-text embeddings, but that aspect contributions are uneven.Qualitative analysis suggests that failures often stem from inconsistent aspect generation and from overemphasizing theme overlap over event-level similarity.
UFG-Semantic at SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions
Aline Hamano | Beatriz Felicio | Henrique Galvão | Nádia Da Silva
Aline Hamano | Beatriz Felicio | Henrique Galvão | Nádia Da Silva
We propose an approach for Task 6: CLARITY - Unmasking Political Question Evasions. We make use of data augmentation, supervised fine-tuning, and model benchmarking to detect and classify response ambiguity in political discourse. Building on well-founded theory on equivocation and leveraging recent advancements in language modeling, our system was structured based on question/answer (QA) pairs extracted from presidential interviews, and it was evaluated in Clarity-level Classification and Evasion-level Classification.
Seals-NLP at SemEval-2026 Task 9: A Comparative Study of Transformer Architectures for Polarization Detection
Minh Smith | Cheryl Seals
Minh Smith | Cheryl Seals
We describe the Seals-NLP system for SemEval-2026 Task 9 (POLAR) Subtask 1, binary polarization detection. Our study compares (i) fully fine-tuned encoder-only transformers, (ii) QLoRA-based fine-tuned open-weight LLMs, and (iii) zero-shot prompted LLMs. ModernBERT-large emerges as the most cost-effective option, matching or surpassing larger fine-tuned and zero-shot LLMs in macro-F1 while requiring substantially less memory and lower latency. An error analysis by failure mode and polarization subtype reveals systematic over-triggering on political cue words and under-detection of sarcastic vilification and multifaceted attacks in the POLAR dataset across all models.
Team JAT at SemEval-2026 Task 9: Enhancing Polarization Detection with Cross-Lingual Transfer and Feature Fusion
Aleksandra Matkowska | Taya Lin | Yu-Chun Chao
Aleksandra Matkowska | Taya Lin | Yu-Chun Chao
We describe our system for SemEval-2026 Task9 (POLAR), Subtask 1 - binary polarizationdetection. Our approach investigates polariza-tion detection through monolingual and cross-lingual experimental settings. We first utilizea RoBERTa-based architecture enhanced withfeature fusion, combining contextual sentencerepresentations with handcrafted sentiment andintensity cues. As for multilingual joint train-ing, we explore it within the Indo-Europeanfamily to test whether cross-lingual transfer canelevate performance in data-scarce scenarios.Our final fine-tuned model achieves averageF1-score of 0.763 on the test set, compared to0.491 for a random baseline. We also reportablations for augmentation, feature fusion, andclass weighting to quantify each component’scontribution.
yasaminal at Semeval2026: Constraint-Aware Humor Generation with Knowledge Graph Guidance
Yasamin Aali
Yasamin Aali
This paper presents a knowledge-guided humor generation system, which involves generating humorous text from either a pair of words or a news headline. The proposed approach integrates structured semantic reasoning derived from a knowledge graph (KG) with controlled generation using large language models (LLMs). The system constructs an intermediate KG hint consisting of related concepts retrieved in the target language, which is appended to the prompt to guide the generation process in a structured manner. A single candidate joke is generated per input using controlled top-p decoding. Experimental results show that incorporating KG reasoning improves relevance and constraint satisfaction, while LLM-based generation ensures fluency and creativity. Overall, the proposed method offers a structured and interpretable framework for humor generation across multiple languages.
MALTO at SemEval-2026 Task 13: Detecting Human, AI, and Hybrid Code via Hard Negative Mining and Curriculum-Driven Ensembles
Hüseyin Arslan | Evren Ayberk Munis | Timofei Khudonogov | Mert Akgun | Murat Besli | Ayhan Meherrem | Claudio Savelli | Flavio Giobergia
Hüseyin Arslan | Evren Ayberk Munis | Timofei Khudonogov | Mert Akgun | Murat Besli | Ayhan Meherrem | Claudio Savelli | Flavio Giobergia
The rapid advancement of Large Language Models (LLMs) has significantly impacted software engineering, posing challenges for determining the origin and authenticity of source code. This paper presents the MALTO team’s submission for SemEval-2026 Task 13, explicitly focusing on Subtask B (Authorship Attribution among 11 classes) and Subtask C (Hybrid Code Detection). To address severe class imbalance and the complex boundaries of mixed human-machine code, we propose a unified framework that leverages an ensemble of UniXcoder and CodeT5. Our approach integrates a robust Tree-sitter-based Universal Canonicalization strategy, Data Augmentation, and a novel 3-Phase Curriculum Training schedule enhanced by Hard Negative Mining. Specifically, UniXcoder’s cross-modal representations excel at distinguishing among semantically overlapping LLM families (Subtask B), whereas CodeT5’s identifier-aware architecture is superior at detecting subtle structural anomalies in hybrid and adversarial snippets (Subtask C). By aggregating these complementary strengths, our soft-voting ensemble overcomes the limitations of individual models, demonstrating strong robustness against imbalanced distributions and effectively discriminating between purely human, purely machine, hybrid, and adversarial code snippets.
blue at SemEval-2026 Task 4: Synergizing Long-Context Reranking with Semantic Similarity for Narrative Alignment
Krish Sharma | Lakksh Sharma | Rhea Singhal | Jatin Bedi
Krish Sharma | Lakksh Sharma | Rhea Singhal | Jatin Bedi
This paper describes the system submitted by team blue for SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning, with a primary focus on the Pairwise Similarity subtask (Track A). The core challenge of this task lies in identifying deep structural alignments between stories, which is fundamentally hindered by the restricted context windows of standard transformer architecturesthat truncate narratives before reaching critical plot resolutions. To overcome this context bottleneck, we propose a hybrid ensemble architecture designed to capture extended narrative arcs. Our approach synergizes a cross-encoder (Jina Reranker v2), which processes long inputs via a sliding-window strategy over 1,024-token chunks, to evaluate the global "course of action," with a semantic bi-encoder (RoBERTa-Large) to validate local tonal consistency. This dual-stream system achieved a Pearson correlation score of 0.63, demonstrating that processing narrative content beyond the 512-token truncation boundary is strictly necessary for accurate pairwise narrative comparison.
blue at SemEval-2026 Task 5: NarrBERT : Narrative-Aware BERT for Word Sense Disambiguation
Rhea Singhal | Krish Sharma | Lakksh Sharma | Jatin Bedi
Rhea Singhal | Krish Sharma | Lakksh Sharma | Jatin Bedi
This paper outlines the method submitted by team blue for the SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative (AmbiStory). The task requires predicting reasonable scores that match human thoughts and judgments instead of just picking a single correct sense as the output. This means that contextual reasoning with fine-grain contextual modeling is vital. In order to tackle this problem, we suggest a BERT-based cross-encoder regression model. This model encodes the entire narrative context, which includes the precontext, the ambiguous sentence, and the ending, along with candidate sense definitions and example usages. Unlike bi-encoder sentence-level methods, our model allows for token-level interaction between story cues and sense meanings. This interaction helps capture subtle narrative disambiguation signals. We conduct a systematic exploration of model architectures and training strategies, progressing from a sentence-transformer baseline to an optimised BERT cross-encoder. On the development set, our best configuration achieves a Spearman rank correlation of 0.66. On the official test set, the system achieves a Spearman correlation of 0.4866 and an Accuracy-within-Standard-Deviation of 0.6613, substantially outperforming sentence-transformer bi-encoder baselines.
LATE-iimas at SemEval-2026 Task 10: Conspiracy Detection via DeBERTa-v3 Ensemble and Weighted Loss Optimization
Jose Vazquez-Cerrillo | Helena Gomez-Adorno | Gemma Bel-Enguix
Jose Vazquez-Cerrillo | Helena Gomez-Adorno | Gemma Bel-Enguix
This paper describes the system developed by the LATE-iimas team for Task 10 of SemEval-2026: Psycomark, specifically for Subtask 2, which involves conspiracy detection. Our approach was based on fine-tuning the popular pre-trained language model DeBERTa-v3-Large. To address the challenges inherent in the provided dataset, such as class imbalance and the linguistic ambiguity of the "Can’t tell" label, we implemented a 5-Fold Stratified Cross-Validation technique combined with a Weighted Cross-Entropy Loss function. The final system, which operates using an ensemble of the resulting models, achieved a Weighted F1-Score of 0.75, placing it in the top 10 of the ranking.
GIL-Zaragoza at SemEval 2026 Task 11: Comparing Classification, Autoformalization, and Ontologies for Formal Reasoning Capabilities
Francisco Lopez-Ponce | Lucia Pitarch | Iván Saavedra Martínez | Ignacio Huitzil | Sergio Ojeda Trueba | Fernando Bobillo | Gemma Bel-Enguix
Francisco Lopez-Ponce | Lucia Pitarch | Iván Saavedra Martínez | Ignacio Huitzil | Sergio Ojeda Trueba | Fernando Bobillo | Gemma Bel-Enguix
This paper describes our participation in Task 11 of SemEval-2026, which evaluates the ability of models to determine logical validity of syllogisms independent of real-world content. We develop and compare three approaches for Subtask 1: (1) an encoder-based classification baseline using both classical ML methods and fine-tuned BERT with debiasing strategies; (2) an autoformalization pipeline combining DPO-aligned models with first order logic translation and formal inference via Prover9; and (3) a hybrid neuro-symbolic approach using GPT to generate OWL 2 ontologies evaluated with the HermiT reasoner. Our best result was achieved by the encoder-based classifier, obtaining a 72.25\% accuracy and a combined score of 20.37, placing 40th out of 45 participating teams. Analysis shows that classification methods exhibit lower content bias, autoformalization approaches suffer from translation inconsistencies and syntax incompatibilities, and ontology-based reasoning is hindered by prompt design limitations and verbose serialization formats. All our code can be found in the paper’s repository.
Polito Team at SemEval-2026 Task 8: Scaling Multi-Turn RAG: High-Performance Parallelized Pipeline for the MTRAG Benchmark
Murat Çelik | Nejla Dinçer | Can Ersoy | Mert Toprak | Barış Ünal | Riccardo Coppola | Flavio Giobergia
Murat Çelik | Nejla Dinçer | Can Ersoy | Mert Toprak | Barış Ünal | Riccardo Coppola | Flavio Giobergia
Recently, Retrieval-Augmented Generation (RAG) has become a significant task in Large Language Models (LLMs). In multi-turn RAG, a good system must overcome the challenges of maintaining context as the dialogue turns progress and manage the issue of generating answers based on conversation history. In this work, we address the MTRAGEval task 8 at SemEval-2026, by presenting a high-performance, parallelised Multi-Turn RAG pipeline designed to address three subtasks: Retrieval (Subtask A), Generation (Subtask B), and End-to-End RAG (Subtask C). Our methodology utilises a Streamlit framework that allows users to embed diverse corpora with varying vector spaces and embedding models, facilitating configuration for each task based on its nature. Some key experiments focus on the performance of different vector databases and embedding models, the necessity of LLM-based query rewriting (QR) for non-standalone questions, the use of different rerankers, and the scale and performance of the selected LLM for answer generation. We conclude that a configuration utilising query rewriting along with reranking delivers the best results. The code is available on GitHub https://github.com/merttoprak1/MTRAGEval-Evaluating-Multi-Turn-RAG-Conversations.
CICL26 at SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Wanzhao Zhang | Yue Yu
Wanzhao Zhang | Yue Yu
This paper describes our submission to SemEval-2026 Task 4 (Track A) on narrative similarity.The task requires systems to determine which of two candidate stories is more narratively similar to a given anchor story. While large language models (LLMs) demonstrate strong semantic reasoning abilities, their predictions in comparative settings can be sensitive to stochastic decoding and input order.We propose a lightweight inference-time cascade strategy that improves robustness without modifying the underlying model. Our approach combines self-consistency voting to reduce sampling variance,a swap-based symmetry test to mitigate positional bias, and a margin-based deterministic decision rule to resolve disagreements. This design explicitly leverages model uncertainty while maintaining reproducibility and simplicity.
UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection
Kargi Chauhan | Sadiba Nusrat
Kargi Chauhan | Sadiba Nusrat
This paper presents the system for SemEval-2026 Task 13, addressing both binary detection (Subtask A) and multi-class attribution (Subtask B). For Subtask A, we propose a robust multi-view training framework using UniXcoder-base, incorporating domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, and token dropout. Our system achieves a high macro F1 of 0.845 on the out-of-distribution test set, demonstrating strong generalization across five unseen languages and two unseen domains. For Subtask B, we provide a rigorous diagnostic analysis of majority-class bias in transformer-based detectors. We reveal a significant performance gap where an 88.4% accuracy masks a near-complete failure in minority-class attribution (0.086 Macro F1), highlighting that standard fine-tuning is insufficient for fine-grained generator identification. Our results expose distinct regimes in code detection and motivate the need for imbalance-aware, structure-focused modeling in future work.
MIUN BiasPatrol at SemEval-2026 Task 13: Why TF-IDF can Beat Transformers for OOD Code Detection
Loviza Sahlen | Thomas Springfeldt | Mehwish Fatima | Raja Khurram Shahzad
Loviza Sahlen | Thomas Springfeldt | Mehwish Fatima | Raja Khurram Shahzad
The increasing use of AI-generated code underscores the need for effective detection systems. However, their performance often deteriorates when faced with distribution shifts. This paper presents our system for SemEval-2026 Task 13: A, which focuses on binary classification of human-written versus machine-generated code across various programming languages and domains. We systematically compare traditional classifiers, such as Random Forest and XGBoost, which utilize statistical and TF-IDF features, against pipelines that leverage frozen embeddings from advanced code transformers like UniXcoder and GraphCodeBERT. Our results reveal a notable trade-off, i.e., while transformer-based pipelines excel in in-distribution validation (reaching up to 0.89 Macro F1), they experience severe performance drops up to 77%; when applied to out-of-distribution languages and domains. In contrast, models employing TF-IDF with tree-based classifiers demonstrate significantly greater stability. We identify this fragility as a result of a bias toward superficial formatting, particularly whitespace, which is accentuated by transformers. By implementing simple space normalization, we enhance the out-of-distribution robustness of traditional models; however, this also highlights the ongoing dependence of embeddings on these non-semantic features. Our findings suggest that for creating generalizable code detection systems, straightforward, well-normalized lexical features may be more reliable than complex, unrefined embeddings.
MINDS at SemEval-2026 Task 9: A Multi-Paradigm Approach to Cross-Lingual Polarization Detection
Angelo Iannielli | Samuele Maroli | Marco Roberto | Stefano Sammartino | Valentino Vacirca | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Angelo Iannielli | Samuele Maroli | Marco Roberto | Stefano Sammartino | Valentino Vacirca | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Online polarization has become a central challenge in digital discourse, characterized by hostility, identity-based division, and culturally dependent expressions that vary across languages. Automatically detecting such phenomena is particularly difficult in multilingual settings, where semantic nuance and implicit rhetoric complicate cross-lingual generalization.In this context, we participate in POLAR, a shared task at SemEval 2026 on multilingual polarization detection and categorization across 22 languages. We compare three modeling paradigms: multilingual encoder fine-tuning, translation-based transfer learning, and prompting-based generative reasoning. For the multi-label categorization task, we introduce a two-stage cascaded architecture to mitigate false positives under severe class imbalance.Our results show that multilingual encoders achieve the most robust performance for binary detection, whereas reasoning-based prompting is competitive for fine-grained category classification. This comparative study highlights the strengths and limitations of each paradigm for cross-lingual polarization analysis.
GuysLLM at SemEval-2026 Task 5: NLI-Informed Regression for Graded Word-Sense Plausibility in Narrative Contexts
Niccoló Antonelli-Dziri | Sixtine Marcotte | Emanuele Rosapepe | Gabriele Santona | Omar Wafaay | Lorenzo Vaiani | Riccardo Coppola | Flavio Giobergia
Niccoló Antonelli-Dziri | Sixtine Marcotte | Emanuele Rosapepe | Gabriele Santona | Omar Wafaay | Lorenzo Vaiani | Riccardo Coppola | Flavio Giobergia
While large language models (LLMs) excel at semantic reasoning, their discrete token-based outputs introduce limitations for fine-grained regression tasks requiring continuous scoring. We address graded word-sense plausibility estimation by reformulating it as a Natural Language Inference (NLI) regression problem, adapting DeBERTa-v3-large with NLI pretraining and a regression head to predict continuous plausibility scores from story-sense pairs. We compare this model against BERT, vanilla DeBERTa, SmolLM variants and state-of-the art LLMs under various prompting strategies, and show that the NLI-finetuned model achieves superior rank correlation and alignment with human judgments. While several baselines collapse toward mean predictions and LLMs show unstable prompting sensitivity, our findings establish NLI-informed pretraining as highly effective for narrative plausibility regression, highlighting fundamental LLM limitations for word sense disambiguation.
AbstractReasoner at SemEval-2026 Task 11: Reducing Content Effects via Knowledge Distillation and Structured Reasoning Prompts
Akash Chowdhury | Vlad Pavlovich | Julius Dunfoy | Sophia Yang | Abhiram Borra
Akash Chowdhury | Vlad Pavlovich | Julius Dunfoy | Sophia Yang | Abhiram Borra
Syllogistic reasoning serves as a critical diagnostic for evaluating whether Large Language Models (LLMs) perform genuine logical inference or rely on semantic shortcuts. SemEval-2026 task 11 explores "content effects"—where model judgments are biased by world knowledge rather than logical form. Recent work has illustrated that LLM optimization techniques have provided substantial performance gains in mitigating content effect. To contribute to this research domain, this paper performs a systematic study of different intervention strategies: zero-shot chain of thought, symbolic representation, activation-steering, and supervised fine-tuning along with prompting optimization during inference. We achieved the best performance with our largest model (Phi-4 14B) by fine-tuning with chain of thought distillation, symbolic abstractions and LLM as optimizer prompting (FTOptim) evaluated on the held-out split derived from the training data. This approach achieved the highest Combined Smooth Score (CSS) of 31.16. Additionally, Llama 3.1 provided noteworthy performance with 31.01 CSS under the same FTOptim approach, indicating the performance gain was LLM-agnostic.
AI4PC-Howard University at SemEval-2026 Task 9: Evaluating Teacher-Student Weak Supervision and Direct LLM Prompting for Multilingual Political Polarization Detection
Surangana Aryal | Saurav Aryal
Surangana Aryal | Saurav Aryal
We describe the AI4PC–Howard University submission to SemEval-2026 Task 9, Subtask 1 on multilingual political polarization detection across 22 languages. We investigated two approaches: (1) a weakly supervised teacher–student framework in which a large language model (LLM) generated pseudo-labels to train an XLM-RoBERTa-base classifier, and (2) (2) a context-engineered prompt-based approach using Meta-Llama-3.1-8B-Instruct. The teacher–student approach exhibited instability under distribution shift and collapsed toward majority predictions at test time. Consequently, our final submission used direct inference with Meta-Llama-3.1-8B-Instruct. While this approach produced competitive macro-F1 across evaluated languages, results reveal strong positive-class bias and substantial precision–recall imbalance. Our findings highlight limitations of weak supervision for subjective political tasks and underscore trade-offs between scalability, bias, and computational cost in LLM-only multilingual systems.
SpyComet at SemEval-2026 Task 11: When Adversarial Debiasing Backfires - A Comparison of Data-Level and Model-Level Debiasing
Sai Aravind C | Sunil Saumya | C Pothan Reddy
Sai Aravind C | Sunil Saumya | C Pothan Reddy
We describe MLA-CI (Multi-Layer Adversarial for Content Invariance), a DeBERTa-v3-base system for SemEval-2026 Task 11 Subtask 1 on content-invariant syllogistic reasoning. MLA-CI combines multi-layer feature extraction, gradient-reversal adversarial training, structure-preserving template augmentation, implausible-class oversampling, and focal loss. Our principal contribution is a systematic ablation study, confirmed across three random seeds, showing that adversarial training at standard strength is counterproductive: removing gradient reversal improves the mean validation score from 26.41 ± 0.99 to 38.15 ± 5.32. Per-condition analysis reveals that gradient reversal over-suppresses plausibility-correlated features, creating an inverted content effect that disproportionately harms plausible-valid accuracy. A sweep over seven adversarial pressure values reveal that only very light adversarial pressure value (≤ 0.1) preserves accuracy, while the submitted adversarial pressure value (1.0 or above) cause severe degradation. We conclude that data-level debiasing through structure-preserving augmentation is more effective and robust than model-level adversarial debiasing for this task.
TeamSLS at SemEval-2026 Task 13: Detecting Machine-Generated Code with CodeBERT and Structural Features
Sai Laasya Gorantla | Shreemithra Naveen | Steven Bethard
Sai Laasya Gorantla | Shreemithra Naveen | Steven Bethard
We describe our system for SemEval-2026 Task 13 Subtask A, which focuses on detecting whether source code is written by a human or generated by an AI system. We propose a hybrid approach that combines semantic embeddings from CodeBERT with lightweight, language-agnostic structural features extracted using Tree-sitter. We compute normalized structural ratios such as nesting depth, logic density, complexity per line, average line length, and punctuation frequency. These structural signals are concatenated with CodeBERT embeddings and passed to a linear classifier for binary prediction. Experimental results on the official validation split show that combining semantic and normalized structural representations substantially improves the model’s detection performance on seen-language distributions. However, results on unseen test data reveal significant performance degradation under cross-language distribution shifts. On the official leaderboard, our system ranked 47th out of 81 participating teams.
Howard University-AI4PC at SemEval-2026 Task 1: Exploring Prompt Strategies for Automatic Humor Generation
Lawal Abdulmujeeb | Saurav Aryal
Lawal Abdulmujeeb | Saurav Aryal
We present our solution system for SemEval-2026 Task 1-Subtask A, a humor generation task requiring systems to generate jokes, given either a news headline or word-pair inputs. Our approach used the Llama-3.1-8B-Instruct model and we selected this model after comparing several candidate models and humor strategies across our experiments. For the headline inputs, we used a two-shot prompt to frame the output as a tweet and specifying the tone proved to be a particularly important factor in output quality. As for the word-pair inputs, we instructed the model to commit to an everyday situation and generate a funny thought based on that. Also, while experimenting, we noticed that models would start a joke one way with the first word and abruptly shift context mid-joke just to include the second word, and committing to a single situation helped handle that. We also made use of personas here, specifically using Dave Chappelle. Our final system shared 2nd place with 3 other systems out of 32 total systems and achieved an Elo score of 1020. Achieving these results, with no fine-tuning, suggests that careful prompt design alone can yield competitive results.
Howard University-AI4PC at SemEval-2026 Task 8: Query Reformulation and Dense-Lexical Retrieval Fusion for Multi-Turn Retrieval-Augmented Generation
Sijan Shrestha | Saurav Aryal
Sijan Shrestha | Saurav Aryal
We present a training-free hybrid retrieve-then-rerank system for multi-turn retrieval-augmented generation, submitted to allthree subtasks of SemEval-2026 Task 8(MTRAGEval): passage retrieval (Task A),generation with reference passages (Task B),and end-to-end RAG (Task C). Our system ad-dresses the core multi-turn challenges—non-standalone questions, unanswerable queries,and shifting passage relevance—across fourdomain-specific corpora: ClapNQ, Cloud,FiQA, and Govt. Queries are reformulatedthrough LLM-driven rewriting, decompositioninto sub-queries, and Hypothetical DocumentEmbeddings (HyDE). Retrieved candidatesfrom dense vector search (BGE-base-en-v1.5)and BM25 lexical matching are fused via Re-ciprocal Rank Fusion and reranked by a cross-encoder (BGE-reranker-large). Llama-3.3-70B-Instruct generates extractive, context-groundedresponses with built-in abstention for unanswer-able queries. Using only open-source mod-els without fine-tuning, the system achievesnDCG@5 of 0.4098 on Task A (22nd/38), aharmonic mean of 0.7462 on Task B (9th/26),and 0.5796 on Task C (2nd/29), coming within1.1% of the top submission. We attribute thestrong Task C result to the synergy betweenmulti-signal query reformulation and faithfulextractive generation.
UNF-BMI at SemEval-2026 Task 3: Research Domain Criteria-Guided Large Language Models for Dimensional Aspect-Based Sentiment Analysis
Athlene Jones | Vishwaa Shah | Indika Kahanda
Athlene Jones | Vishwaa Shah | Indika Kahanda
We present UNF-BMI system for SemEval-2026 Task 3, Track A, Subtask 1 (Dimensional Aspect Sentiment Regression, DimASR), which focuses on predicting continuous Valence–Arousal (VA) scores for aspects in text. Our approach integrates psychologically grounded affective signals inspired by the Research Domain Criteria (RDoC) framework. We investigate two complementary methods: first, an in-context learning framework using Mistral-7B-Instruct with semantically retrieved few-shot examples augmented by lexicon-derived RDoC valence and arousal cues; second, a supervised multi-task learning model based on RoBERTa, where VA regression is the primary objective and RDoC-based positive/negative signal prediction serves as an auxiliary task to regularize shared representations. Experiments on english laptop and restaurant review datasets demonstrate that incorporating RDoC-inspired affective priors reduces RMSE compared to baselines, particularly in low-signal text where explicit sentiment cues are sparse.
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
Yusser Al Ghussin | Daniil Gurgurov | Yasser Hamidullah | Josef Van Genabith | Cristina España-Bonet | Simon Ostermann
Yusser Al Ghussin | Daniil Gurgurov | Yasser Hamidullah | Josef Van Genabith | Cristina España-Bonet | Simon Ostermann
Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language–region pairs (some configurations even degrade performance), and interact with prompt formulation (generic vs. culturally conditioned prompts). Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference. We release our code and experimental configurations at https://github.com/Yusser96/SemEval-2026-Track7.
sutta at SemEval-2026 Task 12: A Multi-Perspective Retrieve-Verify-Aggregate Framework for Abductive Event Reasoning
Junliu Zou | Liang Yang | Jingjie Zeng
Junliu Zou | Liang Yang | Jingjie Zeng
We present our system for SemEval-2026 Task 12: Abductive Event Reasoning (AER). The task asks models to identify the direct causes of real-world events from multiple-choice options using retrieved documents. Rather than fine-tuning on the training data, we built a zero-shot "Retrieve-Verify-Aggregate” pipeline around Qwen3-8B. We first isolate relevant evidence using BM25 and cross-encoder reranking. To evaluate causal links, we prompt the model with several distinct "personas” and aggregate their independent decisions through majority voting. Our system scored 0.7614 on the official test set. This performance suggests that strict retrieval combined with diverse reasoning prompts can help compact open-source models ignore irrelevant context and perform complex causal inference, entirely without task-specific training.
Mendel292 at SemEval-2026 Task 4: Disentangled Narrative Embeddings for Story Similarity
Mauricio Gruppi | Sankalpa Rijal | Justin Debenedetto
Mauricio Gruppi | Sankalpa Rijal | Justin Debenedetto
This paper describes Mendel292, our system for SemEval-2026 Task 4 on Narrative Story Similarity. We introduce a narrative encoder that decomposes story representations into explicit subspaces for abstract theme, course of action, and outcome, built on a pre-trained sentence embedding model and trainable BiLSTM projection layer with a triplet margin loss objective. We augment the training set via backtranslation, and incorporate weakly supervised multi-task objectives derived from unsupervised narrative clustering.The proposed architecture was designed to learn a latent representation of narratives in a few-shot setting due to a limited amount of traninig data.Despite using a rich pre-trained transformer, the model was outperformed by a unsupervised pooling approach on the classification task.While our systems do not match the top leaderboard scores, they allow us to systematically study the effects of subspace factorization, weak labels, and data augmentation on narrative similarity modeling.
GUIR at SemEval-2026 Task 8: Training-Free Multi-Query Fusion for Robust Conversational Retrieval
Pasha Abrishamchian | Ophir Frieder | Nazli Goharian
Pasha Abrishamchian | Ophir Frieder | Nazli Goharian
We describe our SemEval-2026 Task 8 Subtask A system, which focuses on evaluating and improving the retrieval aspect of multi-turn Retrieval-Augmented Generation (RAG) conversations. We implement a training-free fusion approach that combines three distinct query representations to retrieve documents independently. The results from these three views are pooled and reranked using a MonoT5 cross-encoder. Our findings demonstrate that this fusion approach consistently outperforms single-strategy baselines, revealing that optimal retrieval strategies vary significantly at the query level, and establishing multi-query fusion as a baseline for multi-turn RAG systems.
AI4PC-Howard University at SemEval-2026 Task 5: Calibrated Hybrid Ensembling and Retrieval-Augmented LLM Reasoning for Narrative Word-Sense Plausibility
Kwaku Asare | Saurav Aryal
Kwaku Asare | Saurav Aryal
We present two complementary approaches for rating word-sense plausibility in SemEval-2026 Task 5 (literary homonyms in five-sentence stories). Approach 1 is a retrieve-then-generate pipeline using an open-weight Llama 3.1 70B Instruct model with structured reasoning and a self-correction pass. Approach 2 is a hybrid ensemble that combines API-based LLM prompting with transformer representations and a learned calibration layer trained on the development set. On the development set, Approach 2 achieves Spearman ρ = 0.7393 (p 10-102) with accuracy 0.8010 (471/588). Approach 1 achieves ρ = 0.5187 (p 10-65) with accuracy 0.6032 (561/930). We emphasize that Approach 1 does not exceed RoBERTabase in accuracy (0.6032 vs. 0.6410), but provides stronger rank correlation.
Howard University-AI4PC at SemEval-2026 Task 7: Culturally Aware Multilingual Model Routing Through a Mixture-of-Specialists Framework
Isaac Adjei | Saurav Aryal
Isaac Adjei | Saurav Aryal
SemEval-2026 Task 7 (BLEnD) evaluates culturally contextual multiple-choice reasoning across 26 languages and 30 geographic regions, emphasizing everyday knowledge, cultural norms, and region-specific variations in language use. This paper presents the Howard University–AI4PC system, a Phase~1 implementation of a culturally aware Mixture-of-Specialists (MoS) framework designed to improve multilingual cultural reasoning without requiring large-scale fine-tuning. Our approach integrates four key components: (1) linguistic and regional metadata extraction for identifying language, dialect, and cultural context; (2) a hierarchical routing strategy that selects the most culturally aligned model path; (3) Model Control Prompting (MCP), which injects region-aware constraints, dialectal hints, and output-format controls; and (4) a lightweight retrieval-augmented layer that supplies culturally specific factual cues. Although specialist LoRA/QLoRA adapters are planned for future phases, the routing and prompting layers alone achieve 80.01\% accuracy on 47{,}014 test MCQs, demonstrating that cultural grounding and linguistically informed routing substantially enhance performance even in the absence of trained experts. We summarize the task, describe the system in detail, present quantitative and qualitative analyses, and outline next-stage extensions involving specialist model training and expanded cultural knowledge integration.
GenAIus at SemEval-2026 Task 8: Beyond Retrieval with Relevance-Aware RAG for Faithful Multi-Turn Generation
Suveyda Yeniterzi | Reyyan Yeniterzi
Suveyda Yeniterzi | Reyyan Yeniterzi
This paper describes our submission to SemEval-2026 Task 8 on multi-turn retrieval-augmented generation (RAG). We propose a hybrid multi-stage pipeline that combines high-recall lexical retrieval, dual-embedding dense re-ranking with reciprocal rank fusion, LLM-based relevance judging, and strictly constrained evidence-grounded generation. Our design emphasizes robustness and faithfulness across the full retrieval-to-generation pipeline. Our results suggest that relevance-aware filtering and constrained generation are important for improving faithfulness and overall RAG performance.
FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction
Adewale Akinfaderin | Nafi Diallo
Adewale Akinfaderin | Nafi Diallo
We present FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 (Subtask 1), which addresses syllogistic validity prediction while reducing content effects on predictions. Our approach combines an ensemble of five LLM classifiers, spanning three open-weights models (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B) paired with varied prompting strategies, with a Z3 SMT solver that serves as a formal logic tiebreaker. The central hypothesis is that LLM disagreement within the ensemble signals likely content-biased errors, where real-world believability interferes with logical judgment. By deferring to Z3’s structurally-grounded formal verification on these disputed cases, our system achieves 94.3% accuracy with a content effect of 2.85 and a combined score of 41.88 in nested 5-fold cross-validation on the dataset (N = 960). This represents a 2.76-point improvement in combined score over the pure ensemble (39.12), with a 0.9% accuracy gain, driven by a 16% reduction in content effect (3.39→2.85). Adopting structured-output API calls for Z3 extraction reduced failure rates from ∼22% to near zero, and an Aristotelian encoding with existence axioms was validated against task annotations. Our results suggest that targeted neuro-symbolic integration, applying formal methods precisely where ensemble consensus is lowest, can improve the combined accuracy-plus-content-effect metric used by this task.
Tübingen-CL at SemEval-2026 Task 12: Reinforcement Learning and Verification for Abductive Reasoning
Bolun Liang | Ayperi Khudaybergenova | Shashikala Kankanamge
Bolun Liang | Ayperi Khudaybergenova | Shashikala Kankanamge
We investigate the reliability of verifier-based pipelines for abductive reasoning in SemEval-2026 Task 12. While reinforcement learning improves the base generator’s performance, we find that incorporating a small-model verifier introduces a significant generalization gap: although effective on validation data, the verifier systematically degrades correct predictions on the unseen test set by appending false positives. Furthermore, we reveal a critical vulnerability in the official evaluation metric, which assigns zero reward to abstentions but does not sufficiently penalize incorrect selections. This asymmetry enables trivial heuristic strategies such as blindly selecting a default option to substantially inflate performance, even outperforming more principled reasoning systems. Our analysis demonstrates that current evaluation protocols can misrepresent true reasoning ability and highlights the need for more robust verification methods and scoring schemes.
AI4PC-Howard University at SemEval-2026 Task 12: Evidence-Guided Abductive Scoring with Option-Conditioned Retrieval and Constrained LLM Evaluation
Ifeoluwakiitan Ayandosu | Saurav Aryal
Ifeoluwakiitan Ayandosu | Saurav Aryal
Abductive event reasoning in the wild requires selecting plausible explanations for an event from noisy, partially relevant multi-document context. We present an evidence-guided abductive scoring pipeline for SemEval-2026 Task~12 that separates evidence selection from explanation scoring.For each topic, we chunk documents and retrieve option-conditioned evidence using dense embeddings, then apply a cross-encoder reranker to form compact evidence packs per option. A constrained large language model scorer evaluates each option using only its evidence pack and outputs structured signals capturing evidence support, explanatory directness, and contradiction. We then apply deterministic decision rules to produce single or multi-label predictions, including robust handling of “none of the above” style options through lexical-cue detection rather than reliance on option position. This modular design reduces distraction from irrelevant documents, improves comparability across options, and enables controlled calibration for multi-answer outputs. Our approach demonstrates that retrieval-focused evidence compression combined with disciplined, signal-based scoring can effectively support abductive reasoning without explicit knowledge graphs or end-to-end prompting over full document context.
UPR at SemEval-2026 Task 9: Polarization Detection in Urdu with Language-Specific Transformer and Data Augmentation
Alishba Wazir | Muhammad Asad Khan | Junaid Rashid | Shamaila Hayat | Samira Kanwal
Alishba Wazir | Muhammad Asad Khan | Junaid Rashid | Shamaila Hayat | Samira Kanwal
This paper addresses polarization detection in Urdu, a low-resource language characterized by complex morphology and insufficient annotated data. We formulate the task as a binary classification problem of social media posts into polarized and non-polarized categories. Our approach is based on Urdu-BERT, a language-specific transformer model combined with language-specific preprocessing, duplicate removal, and data augmentation to mitigate class imbalance and improve generalization. Experimental results show that the fine-tuned Urdu-BERT outperforms TF-IDF-based lexical machine learning baselines and achieves strong performance relative to multilingual transformer baselines. The findings indicate that language-specific pretrained transformers, when combined with appropriate preprocessing and augmentation strategies, provide an effective and generalizable framework for low-resource Urdu polarization detection.
UPR at SemEval-2026 Task 9: Multi-Label Classification of Polarization Across Social Dimensions and Manifestation Identification in Urdu
Mtayyaba Shahzad | Inzmam Khadam | Zaufishan Mahmood | Junaid Rashid | Shamaila Hayat | Fakhar Ayub
Mtayyaba Shahzad | Inzmam Khadam | Zaufishan Mahmood | Junaid Rashid | Shamaila Hayat | Fakhar Ayub
The analysis of polarized content on social networks is crucial for understanding public discourse; however, research on low-resource languages such as Urdu remains limited. In this work, we address two complementary subtasks of polarization analysis in Urdu social media text. First, we formulate polarization classification across multiple social dimensions as a multi-label task, including political, religious, racial/ethnic, gender/sexual, and other. We fine-tune XLM-RoBERTa for multi-label classification with language-specific preprocessing, duplicate filtering, and data augmentation to handle class imbalance. The proposed model achieves a Macro F1-score of 0.758 for social-dimension polarization classification.Second, we perform polarization manifestation identification, focusing on how polarization is expressed in text through six manifestations: stereotype, vilification, dehumanization, extreme language, lack of empathy, and invalidation. Using the same transformer-based framework with imbalance-aware training, our system achieves a Macro F1-score of 0.72 on the official test set. These results demonstrate the effectiveness of multilingual transformer models for multi-dimensional polarization analysis in low-resource Urdu text.
The Classics at SemEval-2026 Task 3: Combining Transformer Models and LLM-Generated Annotations for Dimensional Aspect-Based Sentiment Analysis
Rafif Alshawi | Amit Raj - | Aleksey Kudelya | Alexander Shirnin
Rafif Alshawi | Amit Raj - | Aleksey Kudelya | Alexander Shirnin
This paper presents an approach to the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis. We investigate methods for moving beyond traditional categorical sentiment (e.g., positive or negative) to predict fine-grained, real-valued scores for sentiment "valence" (positivity) and "arousal" (intensity). We participate in two subtasks: predicting these scores for given aspects (Subtask 1) and extracting full sets of sentiment details, including aspects, categories, and opinions alongside their scores (Subtask 3). Our approach for the regression task involves a weighted ensemble of transformer-based encoder models. For the Russian language, we further enhance the input by using a large language model (LLM) to generate synthetic sentiment descriptions. For the extraction task, we fine-tune a decoder LLM to perform structured prediction, allowing the system to identify sentiment elements and estimate their numerical scores simultaneously.
UTD-HLTRI at SemEval-2026 Task 7: Bridging Cultural Knowledge Gaps in LLMs via Web-Augmented Context
Mohammad Marufur Rahman | Rakshitha Rao Ailneni | Sanda Harabagiu
Mohammad Marufur Rahman | Rakshitha Rao Ailneni | Sanda Harabagiu
Though Large Language Models (LLMs) have been serving global users through a wide range of services, concerns remain regarding their cultural bias and misalignment with people of underrepresented communities. Increasing use of LLMs presents significant implications, as they have the potential to influence people’s original values toward a certain cultural perspective. Cultural alignment of LLMs with culture-specific knowledge offers a suitable solution to this concern. In our participation in the Semeval-2026 Task 7 we considered a prompt engineering-based cultural alignment strategy to address the cultural knowledge gap in LLMs. Our approach achieved promising 86.34% accuracy for Japanese culture-relevant multiple-choice questions from the BLEND benchmark.
MoodMetric at SemEval-2026 Task 4:Narrative Story Similarity and Narrative Representation Learning
Samanvitha Bolisetty | Shreya Ashar | Nishchay Mittal | Pruthwik Mishra
Samanvitha Bolisetty | Shreya Ashar | Nishchay Mittal | Pruthwik Mishra
This paper presents our system for narrative similarity modeling in SemEval Task 4, focusing on transformer-based dense embedding approaches. Modeling similarity between long-form narratives is particularly challenging due to the need to capture event progression, causal structure, character dynamics, and thematic coherence beyond surface-level lexical overlap.We evaluate multiple pretrained encoder-only architectures, including DeBERTa-v3, BGE-Base, BGE-Large, and E5-Large, fine-tuned using triplet margin and contrastive objectives. In addition, we implement a hybrid lexical–semantic baseline combining TF-IDF and SBERT features. Our experiments analyze the impact of model scale, pooling strategies, layer freezing, training duration, and embedding-level ensembling under low-resource conditions (approximately 1,900 training triplets, with additional synthetic augmentation).Results show that larger contrastively pretrained embedding models consistently outperform smaller variants, with BGE-Large achieving the strongest standalone performance. However, performance saturates quickly, and moderate fine-tuning (4–5 epochs) yields optimal validation accuracy, while extended training leads to overfitting. Instruction-tuned embeddings do not demonstrate significant advantages over contrastively aligned alternatives for this task. Finally, arithmetic averaging of embeddings from diverse models produces the most robust representations, achieving approximately 65% validation accuracy.
PLlama at SemEval-2026 Task 4: Zero-shot Prompting with Llama-3.2 for Narrative Similarity
Kanishka Jain
Kanishka Jain
This paper describes our submission to the SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning. The shared task focuses on modeling the similarity across narratives on the basis of perceived relatedness between events’ causality. The task frames narrative similarity as a binary classification problem in which the models determine which of the two stories is more narratively similar to a given anchor story. Our approach leverages the pre-trained language model Llama-3.2-3B-Instruct with prompt engineering, allowing the system to assess narrative similarity without explicit fine-tuning. On the test data, our system achieved an accuracy of approximately 55% in Track A. While modest, our results establish a baseline for narrative similarity detection in large language models (LLMs) highlighting both their potential and challenges of applying computationally efficient instruction-tuned models to this task. Our analysis highlights the struggle of LLMs in capturing event causality and long range narrative dependencies.
Team HITS at SemEval-2026 Task 4:Enhancing narrative text embedding model training with hard negatives generation and self-distillation
Qian Zhou | Yi Fan | Wei Liu | Michael Strube
Qian Zhou | Yi Fan | Wei Liu | Michael Strube
We first use Qwen2.5-32B-Instruct model to generate hard negatives from threenarrative dimensions. We then train a Qwen3-Embedding-8B model with a multi-negativecontrastive objective and use self-distllation.
LATE-IIMAS at Semeval-2026 Task 13: Evaluating GNNs, PLMs, LLMs, and Stylometry for Automatic Code Identification
Andric Valdez | Emmanuel Ancona | Sebastián Bernardino | Helena Gomez-Adorno | Fazlourrahman Balouchzahi | Fabian Herrera
Andric Valdez | Emmanuel Ancona | Sebastián Bernardino | Helena Gomez-Adorno | Fazlourrahman Balouchzahi | Fabian Herrera
The generation of source code via Artificial Intelligence has become a prevalent practice in both academia and industry, posing significant challenges to academic integrity and authorship attribution. In this work, we address SemEval-2026 Task 13: Detecting Machine-Generated Code by evaluating the effectiveness of four distinct methodologies: Graph Neural Networks (GNNs), Pre-trained Language Models (PLMs), Large Language Models (LLMs), and Stylometric Feature Engineering using XGBoost. Our approach focuses on three specific scenarios: Subtask A (Binary Detection), Subtask B (Multi-Class Authorship), and Subtask C (Hybrid Code Detection). While our models achieved high performance during the validation phase, the transition to the final test set revealed substantial challenges in generalization, likely due to the increased diversity of programming languages and generators in the unseen data. This work serves as a foundational first step, identifying critical gaps in model robustness and highlighting the need for more sophisticated methodologies to bridge the performance gap in complex, real-world environments.
UAlberta at SemEval-2026 Task 5: Disambiguating Stories via Task Decomposition
David Basil | Junhyeon Cho | Chirooth Girigowda | Guoqing Luo | Sahir Momin | Sevryn Robinson | Ning Shi | Grzegorz Kondrak
David Basil | Junhyeon Cho | Chirooth Girigowda | Guoqing Luo | Sahir Momin | Sevryn Robinson | Ning Shi | Grzegorz Kondrak
We describe our system for predicting sense plausibility in short narratives. Our approach centers on task decomposition: instead of predicting a score directly, we break the problem into simpler subtasks and combine their outputs. We further improve performance by ensembling complementary signals, including word sense disambiguation and fine-tuned embedding models. We also find empirical support for the one-homonym-per-translation principle of Hauer and Kondrak (2020a). Our best ensemble system achieves competitive performance in the official evaluation. Our code and data are available on GitHub.
ChulaNLP at SemEval-2026 Task 4: Neural Aspect Composition for Narrative Story Embeddings
James Gampper | Attapol Rutherford
James Gampper | Attapol Rutherford
Comparing stories and narratives has proven to be a difficult task to automate because traditional vector representations fail to capture the layered and multi-faceted aspects of stories such as theme, plot progression, and resolution. We address SemEval-2026 Task 4, which requires generating vector embeddings that preserve narrative similarity relationships. We propose Neural Aspect Composition, which functions by using a Large Language Model (LLM) to decompose stories into 13 semantic narrative aspects (theme, course of action, outcomes, etc.), encodes each aspect separately using an encoder model, and learns a global importance weight for each aspect through a trained weighting layer. Our approach achieves the official test scores of 0.64 on Track A and 0.61 on Track B. During validation, it outperformed vectors produced by inputting the raw story text directly into an encoder model and a sentence-averaging baseline. The analysis of the learned weights on the development set reveals that thematic elements and narrative resolutions were the primary drivers of perceived similarity, receiving significantly higher weights than intermediate plot events and other minor details such as character introductions.
GUNLP at SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection (PsyCoMark)
Rojin Ziaei | Mahsa Khoshnoodi | Nazli Goharian
Rojin Ziaei | Mahsa Khoshnoodi | Nazli Goharian
This paper presents the Georgetown University NLP (GUNLP) system developed for SemEval 2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection, addressing the classification of conspiratorial beliefs in Reddit posts (Subtask 2). Our approach leverages COVID-Twitter-BERT v2 (CT-BERT-v2) within a multi-task learning framework that jointly optimizes conspiracy classification and emotion label prediction through a dual-head architecture. To address data scarcity, we enrich the training set using paraphrasing-based data augmentation and GPT-5-generated chain-of-thought emotion annotations, effectively doubling the training corpus to approximately 8,600 examples. We evaluate two input configurations: text only and text concatenated with emotion labels. The emotion-aware configuration achieves the strongest performance with an F1 score of 0.87 on the official development set, outperforming the text-only baseline by five F1 points and demonstrating the value of paraphrased samples and affective auxiliary supervision for conspiracy detection in social media text.
ChulaNLP at SemEval-2026 Task 5: Regression-Calibrated LLM for Word-Sense Scoring
Wayu Limsuwan | Attapol Rutherford
Wayu Limsuwan | Attapol Rutherford
Word Sense Disambiguation (WSD) is typically framed as a classification task that selects one correct sense for a word. However, real language is often less clear-cut, as a homonym may support several plausible interpretations. SemEval 2026 Task 5 addresses this limitation by introducing plausibility rating, where models estimate how likely each sense is in a narrative context, aligning predictions with graded human judgments. We use GlossBERT and BEM as encoder-based baselines and show that large language models (LLMs) produce more accurate plausibility estimates. Building on this observation, we propose a regression-calibrated LLM model that applies linear regression to adjust raw LLM outputs to better match human annotation patterns. Our calibrated model achieves the highest within-standard-deviation accuracy among our evaluated systems, demonstrating that lightweight post-hoc calibration can substantially improve LLM performance on graded semantic judgment tasks.
The Argonauts at SemEval 2026 Task 6: Large Language Models for Response Clarity Classification: Prompting, Fine-Tuning, and Data-Centric Approaches
Sajib Bhattacharjee | Sha Newaz Mahmud | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Sajib Bhattacharjee | Sha Newaz Mahmud | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Detecting equivocation is essential, as indirect or evasive responses can shape public perception, influence political narratives, and undermine transparency in democratic discourse. To address the challenge of detecting evasive political responses on digital platforms, participation in the CLARITY SemEval-2026 Task was undertaken, which focuses on (i) clarity-level classification and (ii) fine-grained evasion-type classification in political question-answer contexts. This study introduces a data-centric framework that systematically examines the effects of class distribution and refinement strategies on the performance of Large Language Models (LLMs). A distribution-aware, LLM-augmented dataset was constructed by selectively paraphrasing minority-class instances to enhance class balance, and its performance was benchmarked against full, rebalanced, and undersampled training configurations. To comprehensively assess the proposed method, Qwen3-14B, Phi-4, Gemma-2 9B, and Mistral 7B were evaluated in in-context learning (ICL) settings (zero-shot and few-shot) and with LoRA fine-tuning. Experimental results indicate that fine-tuning Phi-4 with class rebalancing yields strong performance, achieving 74.77% on Subtask-1 and 51.55% on Subtask-2. Consequently, the system ranked 21st in Subtask-1 and 22nd in Subtask-2 on the official evaluation leaderboard.
IIMAS-RAG at SemEval-2026 Task 8: Hybrid Sparse-Dense Retrieval and Answerability-Conditioned Generation for Multi-Turn RAG
Vania Raya-Rios | Helena Gomez-Adorno | Leon Hecht | Pedro Vázquez-Osorio | Erick Fabián-Sandoval | Jesús Vázquez-Osorio | Diego Hernández-Bustamante
Vania Raya-Rios | Helena Gomez-Adorno | Leon Hecht | Pedro Vázquez-Osorio | Erick Fabián-Sandoval | Jesús Vázquez-Osorio | Diego Hernández-Bustamante
This paper presents IIMAS-RAG, our system for SemEval-2026 Task 8 on evaluating multi-turn retrieval-augmented generation. Our approach combines LLM-based query rewriting, hybrid sparse-dense retrieval with SPLADE and Voyage-3-large fused via Reciprocal Rank Fusion, and answerability-conditioned generation with GPT-4.1. The system ranked 4th out of 38 teams in Subtask A (Retrieval) and 13th out of 29 teams in Subtask C (Full RAG). Our results show that query rewriting is the most impactful retrieval component, while generation remains challenging in low-context and partially answerable scenarios.
ServSocIA at Semeval-2026 Task 9: Evaluating Prompt Strategies for Polarization Detection
Jacob Altamirano | Mario Leon Pérez | Bruno Ruiz-Juarez | Luis Chiruzzo | Helena Gomez-Adorno | Fazlourrahman Balouchzahi
Jacob Altamirano | Mario Leon Pérez | Bruno Ruiz-Juarez | Luis Chiruzzo | Helena Gomez-Adorno | Fazlourrahman Balouchzahi
This paper presents our approach to Subtask 1 of SemEval-2026 Task 9 on multilingual polarization detection in social media texts in English and Spanish. We model the task as a prompt-based binary classification problem and systematically compare zero-shot, one-shot, and few-shot strategies across multiple large language models accessed via commercial APIs, without task-specific fine-tuning. Our controlled experimental setup enforces strict data separation and consistent decoding conditions to analyze the impact of in-context supervision across architectures and languages. Results indicate that well-structured prompting enables competitive performance, though implicit and culturally nuanced polarization remains challenging.
This paper describes our system for POLAR Subtask 1 on multilingual polarization detection. The task involves binary sequence classification over 22 languages, where the model aims to predict whether a given text exhibits polarized discourse. To deal with the multilingual and resource-imbalanced nature of the dataset, we fine-tune the XLM-R, a pre-trained multilingual transformer encoder, using a language-aware sampling strategy that combines all available training data into a unified multilingual corpus. Our system achieves an overall macro-F1 of 0.781 and an average accuracy of 0.823 on the official test set. Results show strong performance in low-resource languages, though some discrepancies indicate remaining class imbalance.
Cherish at SemEval-2026 Task 2: Enhancing RoBERTa-Based Models for Emotional Valence and Arousal Prediction in Ecological Essays with Personalized PLoRA and Temporal Embeddings
Cetta Parahita
Cetta Parahita
This paper describes the system developed by Team Cherish for SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays. Our approach models emotional dynamics in user-generated text by incorporating both personalization and temporal information into a transformer-based architecture. We use RoBERTa-large as the backbone encoder and enhance it with PLoRA and a temporal embedding module. Cherish’s model architecture is designed to maintain general semantic knowledge while subtly adapting to individual users and emotional shifts over varying temporal gaps. Our system achieved 13th place out of 29 teams in Subtask 1, obtaining a Pearson’s r composite score of 0.596 for valence prediction and 0.505 for arousal prediction. While the team also participated in Subtask 2a, technical issues during inference led to zero variance in predictions, resulting in an undefined (NaN) official correlation score.
NLP-CEIA-UFG at SemEval-2026 Task 8: Iterative Retrieval with Notes-Guided Query Refinement for Multi-Turn RAG
Guilherme Dutra | André Felipe Caraíba | Nádia Félix Da Silva | Paulo Dos Santos | Deborah Silva Fernandes | Sávio Salvarino De Oliveira
Guilherme Dutra | André Felipe Caraíba | Nádia Félix Da Silva | Paulo Dos Santos | Deborah Silva Fernandes | Sávio Salvarino De Oliveira
We describe NLP-CEIA-UFG, our system forSemEval-2026 Task 8, which evaluates multi-turn retrieval-augmented generation (RAG)over heterogeneous document corpora. Ourpipeline centers on a three-iteration dynamicretrieval loop in which two gpt-oss-120b-powered modules—an Iterative Query Genera-tor and a Notes Builder—interact at each stepto diversify queries and accumulate structurednotes on information gaps. After the loop, anAnswerability Classifier routes the query to oneof three generation paths (Complete Answer,Partial Answer, or Clarification Request). Hy-brid BM25 and dense retrieval is fused via Re-ciprocal Rank Fusion and refined by the Jinalistwise reranker. The retrieval pipeline is com-piled under DSPy and optimized with GEPA.We achieve nDCG@5 of 0.4502 (rank 17/38,Subtask A) and HM = 0.3774 (rank 24/29, Sub-task C). Post-hoc analysis identifies an over-conservative Answerability Classifier as theprimary bottleneck: 75.5% of all responseswere flagged as IDK by the evaluator, includ-ing 69.8% of ANSWERABLE questions, whilethe retrieval and generation components per-form well when the classifier routes correctly.Our code is available at https://github.com/GuiiCorreia/SemEval-2026.
Sentiment Syndicate at SemEval-2026 Task 6: Reframing Political Question–Answer Interactions via Natural Language Inference for Clarity Level Classification
Rafi Rafsan
Rafi Rafsan
This paper presents the Sentiment Syndicate team’s submission to SemEval-2026 Task 6, Subtask 1 (CLARITY: Unmasking Political Question Evasions), which focuses on classifying the clarity level of political question–answer interactions. We investigate three modeling strategies: (1) fine-tuning a RoBERTa-based classifier, (2) reformulating the task as a Natural Language Inference (NLI) problem, and (3) leveraging large language models (LLMs) for classification. All approaches are evaluated using macro F1-score on the official dataset. Experimental results demonstrate that the NLI based formulation outperforms the other strategies, highlighting the effectiveness of modeling semantic alignment between questions and answers. Our best-performing system achieves an F1-score of 0.67 on the test set.
clulab-retrieval at SemEval-2026 Task 8: A Comparative Analysis of Dense Retrievers and HyDE for Multi-Turn Conversational Retrieval
Hyungji Kim | Siva Rohit Kondapaneni | Steven Bethard
Hyungji Kim | Siva Rohit Kondapaneni | Steven Bethard
We present a comparative analysis of dense retrievers and retrieval strategies for multi-turn conversational retrieval in SemEval-2026 Task 8 (MTRAGEval). Our official submission employed a fine-tuned E5-based dense retriever (E5-FT, ~110M parameters) with Hypothetical Document Embeddings (HyDE), achieving nDCG@5 of .3309, ranking 31 out of 38 systems. On the development set we also compared E5-FT versus BGE embeddings, dense-only versus hybrid retrieval strategies, and HyDE versus keyword extraction approaches. We found: (1) BGE (general-purpose, ~110M) outperforms our domain-fine-tuned E5-FT (~110M) by 30.5% on baseline retrieval, suggesting that model selection may matter more than domain-specific fine-tuning, (2) hybrid retrieval combining BM25 and dense methods provides complementary signals, with HyDE improving BM25 by 26.7% and dense retrieval by 4.0%, and (3) keyword-based query simplification degrades performance by 11-28% across domains, validating HyDE’s approach of preserving semantic richness through passage-level text.
Narrative Nexus at SemEval-2026 Task 4: Modeling Narrative Similarity via Instruction-Based Fine-Tuning and Synthetic Data Augmentation
Haotan Guo | Hongbin Na | Zimu Wang | Wei Wang
Haotan Guo | Hongbin Na | Zimu Wang | Wei Wang
Narrative similarity assessment requires models to reason beyond surface-level lexical overlap and capture higher-level plot structures and thematic relationships. In this paper, we address SemEval-2026 Task 4 Track A: Narrative Story Similarity by reformulating it as an instruction-following generation problem. We employ parameter-efficient fine-tuning via LoRA to adapt pretrained large language models for triplet-based narrative comparison. To overcome the limitations imposed by the scarcity of human-annotated data, we further incorporate synthetic triplet samples generated by a large language model for data augmentation. Experimental results demonstrate that our fine-tuned Qwen2.5-7B model achieves competitive performance, outperforming the zero-shot GPT-4o-mini baseline. These findings underscore the effectiveness of task-specific adaptation combined with synthetic data augmentation for narrative similarity modeling.
ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment
Tai Tran Tan | An Thien
Tai Tran Tan | An Thien
We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: https://github.com/dinhthienan33/SemEval2026-Task4-ttda704.
PingAn-NLP at SemEval-2026 Task 9: Multi-Stage Alignment via GRPO and Tiered Ensemble Voting for Multilingual Polarization Detection
Diyang Chen | Youzhen Pang
Diyang Chen | Youzhen Pang
This submission describes the PingAn-NLP system for SemEval-2026 Task 9 Subtask 3, identifying polarization manifestations in 18 languages. We employ a tiered optimization framework integrating Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Key technical innovations include synthetic reasoning distillation from a 235B teacher model , a Smart-Tradeoff reward function designed to mitigate extreme label imbalance , and a tiered ensemble voting strategy that adaptively adjusts decision thresholds based on language resources. Our 8B-GRPO-Vote system demonstrated robust internal performance in tracks like English and Hindi and officially secured second place in the Bengali, English, Odia, and Turkish competitions.
ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection
Tai Tran Tan | An Dinh
Tai Tran Tan | An Dinh
We present our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which addresses political evasion detection in English question-answer pairs from U.S. presidential interviews.We compare two paradigms: (1) parameter-efficient fine-tuning of Qwen3 models (4B–32B) using QLoRA with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting with reasoning-capable API models, including DeepSeek-V3.2 and Grok-4-Fast.Our best system uses Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieving Macro F1 scores of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity). On the official leaderboard, it ranks 8/33 on Subtask 2 and 13/41 on Subtask 1. Ablation results show that hierarchical label presentation provides a useful reasoning scaffold and that extended reasoning helps models handle subtle pragmatic distinctions, although the strongest prompt variants are not statistically distinguishable in Macro F1.
Multi-Label Polarization Classification with twHIN-BERT and SCUT Threshold Optimization
Ilinca Vandici | Ådne Jøssing | Lukas Viestädt
Ilinca Vandici | Ådne Jøssing | Lukas Viestädt
Tackling task 2, we fine tune a BERT-style encoder with classification heads added on top. We first try out different pre-trained encoder models, before settling on the Twhin-bert multilingual model, since its pretraining corpus (mainly tweets) provides a suitable starting point for our task. To resolve the issue of diverging label annotation styles, we apply the S-Cut algorithm, in order to calibrate thresholds for label selection, and examine its impact. We take a look at the resulting hidden representations in a reduced dimensional space, and examine the linguistic information encoded by our model after fine-tuning using linguistic probing.
DigiS-FBK at SemEval-2026 Task 9: Multi-task Learning for Multilingual and Cross-cultural Polarization Classification
Veronica Orsanigo | Alan Ramponi | Elisa Leonardelli
Veronica Orsanigo | Alan Ramponi | Elisa Leonardelli
Online polarization promotes social fragmentation, misinformation, hate, and toxic language. Polarization has been studied from social and communication perspectives, but it can also be addressed computationally as a text classification task. Due to the variety of polarization targets and manifestations, polarization is a complex phenomenon to study, and both detecting and characterizing it are challenging tasks.In this paper, we present the systems submitted by the DigiS-FBK team to SemEval-2026 Task 9 POLAR aimed at detecting polarization in textual content (subtask 1) and identifying its type (subtask 2) and manifestation (subtask 3) in a multilingual, multicultural, and multievent context. Considering the strong link between subtasks, we propose an approach that leverages a multi-task learning paradigm. Our results reveal that, despite the variability in scores across languages, the overall performance when using multi-task learning is higher than when adopting a single task approach in all subtasks
CausalMinds at SemEval-2026 Task 12: Simple Fine-Tuning with Option Shuffling Outperforms Complex Pipelines for Abductive Event Reasoning
Vidur Gupta | Xiaofei Zhao | Jason Shaye
Vidur Gupta | Xiaofei Zhao | Jason Shaye
We describe our system for SemEval-2026 Task 12 on Abductive Event Reasoning, which requires identifying plausible direct cause(s) of real-world events. We conduct a systematic evaluation of 23 configurations spanning prompting, retrieval-augmented generation, multi-stage verification, and supervised fine-tuning across models of different scales. Across experiments, we found that fine-tuning GPT-4.1-mini with data augmentation via option shuffling consistently outperformed more complex multi-stage pipelines and larger-model prompting strategies. Our system scores 0.88 on the test dataset, ranking 19th out of 221 submissions, which is only 0.07 away from the highest scoring submission of 0.95. Interestingly, chain-of-thought prompting and multi-stage verification hurt performance compared to simpler baselines. This reinforces that simplicity can outperform complex pipelines. We document these negative results and examine the persistent gap between development (0.991) and test (0.88) scores.
CUETClashing at SemEval-2026 Task 1: Multilingual Joke Generation Under Lexical and Topical Constraints Using Small Instruction-Tuned LLMs
Madiha Ahmed Chowdhury | Lamia Khan | Faozia Fariha | Symom Hossain Shohan | Mohammed Moshiul Hoque
Madiha Ahmed Chowdhury | Lamia Khan | Faozia Fariha | Symom Hossain Shohan | Mohammed Moshiul Hoque
Generating humorous text is one of the most challenging tasks in natural language generation, as models must simultaneously juggle creativity, cultural understanding, and rules. To tackle these issues, this paper introduces our system for Subtask A of SemEval-2026 Task 1: MWAHAHA - Models Write Automatic Humor And Humans Annotate, which asks for single-sentence jokes with two rules—certain words must be included, and the joke must relate to a news headline—in English, Spanish, and Chinese. Our method uses instruction-tuned language models: Qwen2.5-3B-Instruct for English and Chinese, and Salamandra-2B-Instruct for Spanish, paired with language-specific prompts, special sampling for outputs, and a strong cleaning process after jokes are generated. Without additional task-specific training, our system generates jokes that adhere to the rules in all three languages, demonstrating that simple prompt design and small, instruction-tuned models can be a strong, efficient way to generate funny text across multiple languages.
Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution
Shifali Agrahari | Abhishek Anand | Shubham Kannaujiya | Sanasam Ranbir Singh | Sujit Kumar
Shifali Agrahari | Abhishek Anand | Shubham Kannaujiya | Sanasam Ranbir Singh | Sujit Kumar
The rise of code-generating LLMs such as DeepSeek, Qwen, and Meta-LLaMA has improved developer productivity but also increased risks of plagiarism, copyright misuse, and insecure machine-generated code. While AI-text detection is well studied, machine-generated source-code detection especially across multiple languages, LLM families, and OOD conditions-remains underexplored. SemEval-2026 Task 13 addresses this via two subtasks: (A) binary human–machine code detection and (B) multi-class authorship attribution across ten LLM families. For Subtask A, we fine-tune RoBERTa, CodeBERT, GraphCodeBERT, and StarCoderBase-1B, introducing a stratified sampling strategy with class-weighted loss to mitigate imbalance and OOD shifts. For Subtask B, we mitigate the extreme human-class imbalance using undersampling, inverse-frequency weights, syntactic noising, and curriculum-based dual-path training with TinyStarCoderPy and CodeBERT. Both results show that long-context modeling, distribution-aware sampling, and noise-robust training are crucial for reliable in real-world settings. Overall, long-context modeling, distribution-aligned sampling, and lightweight noise-robust training emerge as key factors for reliable machine-generated code detection and authorship attribution.
PolaFusion at SemEval-2026 Task 9: Ensemble Transformers with Targeted Augmentation for Multilingual Polarization Detection
Abdullah Mohammad
Abdullah Mohammad
We present PolaFusion, our system for SemEval-2026 Task 9, which requires detecting polarization in social media posts across 22 languages, classifying its type (Subtask 2), and identifying its rhetorical manifestation (Subtask 3). The task is characterized by severe and pervasive class imbalance across all three subtasks and all 22 languages. We address this through a combination of three strategies: a hierarchical gating architecture where a binary gatekeeper model gates two specialist classifiers trained exclusively on polarized content; an eight-model mega-ensemble combining fivefold mDeBERTa-v3-base and three-fold XLM-RoBERTa-large with soft-vote probability aggregation; and a Macro-F1-aware augmentation strategy using Qwen3-235B that generates synthetic minority-class examples only for language-label pairs that are both scarce and poorly learned. Throughout training, inverse-frequency class weighting within BCEWithLogitsLoss forces the model to attend proportionally to rare labels. Our system achieves official Macro-F1 scores of 0.800, 0.576, and 0.502 on Subtasks 1–3 respectively, outperforming the POLAR baseline by +0.040, +0.089, and +0.082 average Macro-F1 across languages. Our code is publicly available at https://github.com/Abdullah4152/PolaFuse.
NLP-CIMAT at SemEval-2026 Task 9: LLM-Based One-Shot and Cross-Lingual Data Augmentation for Polarization Detection
Miriam Calderon-Reyes | Fernando Sanchez-Vega | Adrian Pastor Lopez Monroy
Miriam Calderon-Reyes | Fernando Sanchez-Vega | Adrian Pastor Lopez Monroy
This paper describes our participation in SemEval 2026 Task 9: Multilingual Text Polarization. The task requires estimating polarization levels across languages, where linguistic variability and limited annotated data pose significant challenges. To address data scarcity, we propose a pipeline that combines cross-lingual translation, synthetic data augmentation via LLMs, and domain-specific pre-trained models. Our approach leverages the hypothesis that polarization signals can transfer across languages without substantial loss of semantic alignment, enabling effective data augmentation through translation. Notably, one-shot synthetic example generation emerges as a viable strategy for enriching training data in topic-specific scenarios. Experimental results demonstrate high stability and competitive performance, achieving a macro F1-score of 0.7869 for Spanish and 0.7939 for English on the test set, ranking 21th on the official English leaderboard, while our Spanish results are competitive with top-performing systems, corresponding to 7th place.
Dawn at SemEval-2026 Task 8: Structured Control Decomposition for Faithful Multi-Turn Retrieval-Augmented Generation
Feiling Li | Xiaoya Qi | Xunyue Wang | Pusheng Chen | Zhiwen Tang | Han Yang
Feiling Li | Xiaoya Qi | Xunyue Wang | Pusheng Chen | Zhiwen Tang | Han Yang
Multi-turn Retrieval-Augmented Generation faces structural challenges that go beyond single-turn retrieval and fusion. Context-dependent queries, cross-turn evidence accumulation, and uncertain answerability jointly affect retrieval quality and generation reliability. We propose a structured control framework that formulates multi-turn RAG as a regulated reasoning process rather than a loosely coupled pipeline. The system first performs evidence and context structuring, extracting atomic facts strictly grounded in reference passages while reconstructing a self-contained query from dialogue history. It then conducts decision-conditioned generation, where explicit control signals regarding question intent, dialogue dependency, and answerability govern response feasibility, scope, and organization. By separating structural decision making from surface realization, the framework enforces consistent information flow across stages and reduces hallucination.Experiments on SemEval-2026 Task 8 show that our approach achieves strong faithfulness and stable overall performance, ranking 17/26 on Task B (generation, H=0.6333).
SYSUpporter at SemEval-2026 Task 13: Leveraging Stylistic Signals and Language-Aware Truncation for Machine-Generated Code Detection
Longfeng Chen | Zheng Xiao
Longfeng Chen | Zheng Xiao
This paper describes our system for SemEval-2026 Task 13 Subtask B, which requires attributing source code to either a human author or one of 10 LLM families. Guided by dataset analysis, we identify three practical challenges: formatting fingerprints discarded by tokenizers, heterogeneous code lengths, and extreme class imbalance. We build on unixcoder-base with Explicit Stylistic Prompting, Language-Aware Truncation, and imbalance-aware training (Focal Loss, GeM pooling, multi-sample dropout, and bucket batching). Our system achieves 0.434 Macro F1 on the official hidden test set, ranking 4th out of 34 teams with only 125M parameters. Controlled 5-fold cross-validation confirms that each component contributes to the final system, and a formatting-normalization study quantifies the model’s reliance on formatting cues.
ssurface3 at SemEval-2026 Task 3: Efficient Methods for Multilingual Dimensional Aspect-Based Sentiment Analysis
Anatolii Frolov | Elisei Rykov
Anatolii Frolov | Elisei Rykov
This paper describes our submission to thedimABSA Shared Task (Subtask 1), whichrequires predicting continuous Valence andArousal scores for target aspects in multilin-gual reviews. We evaluate three approaches:prompting-based baselines, a multilingual en-coder model, and a decoder-only LLM withsupervised fine-tuning. Our main focus isefficient adaptation under multilingual datascarcity. We show that compact encoder anddecoder models, when properly fine-tuned,achieve strong performance across languagesand domains. To improve training stability andenforce valid predictions, we use a boundedregression formulation that maps outputs to thetarget score range. We also explore parameter-efficient fine-tuning and intermediate trainingon external affective data. Results show thatprompting-based baselines are substantiallyweaker than supervised models. The mul-tilingual encoder provides a strong and effi-cient baseline, while the best performance isachieved by a compact decoder model withparameter-efficient fine-tuning. Overall, ourfindings highlight the importance of carefulfine-tuning and training design for multilingualdimensional sentiment analysis.
The Argonauts at SemEval-2026 Task 9: Multilingual Polarization Detection and Classification Using LLM Prompting and Transformer Fine-Tuning
Sha Newaz Mahmud | Sajib Bhattacharjee | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Sha Newaz Mahmud | Sajib Bhattacharjee | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Online polarization, defined as the pronounced division of public opinion into antagonistic groups, poses a significant threat to social cohesion. Automatic detection of polarization across diverse languages and cultures is essential for effective monitoring of online discourse. The challenge extends beyond identifying hate speech to recognizing more nuanced forms, including negative stereotypes, attribution of blame, and dehumanization. This work addresses SemEval-2026 Task 9, which focuses on detecting polarization in multiple languages. Specifically, Subtask 1 involves binary classification of message polarization, while Subtask 2 requires assigning multiple polarization labels in English and Bengali. For Subtask 1, Qwen3-14B is employed with structured few-shot prompting in 4-bit mode, yielding test macro-F1 scores of 0.847 for Bengali (4th place) and 0.808 for English (9th place). For Subtask 2, XLM-RoBERTa-large and RoBERTa-base are fine-tuned using an uneven loss (γ+ = 1, γ− =4) and label-specific thresholds, which increase development macro F1 by up to 24.6 points. The final test macro F1 for English is 0.454 (21st place). Analysis indicates that large language model prompting enhances binary polarization detection, while threshold adjustment is critical for addressing class imbalance in multi-label tasks.
TFB at SemEval-2026 Task 4: Diagnosing Model Failures in Narrative Understanding
Anna Colli | Benedictus Kent Rachmat | Eve Sauvage | Delphine Battistelli | Thomas Gerald | Cyril Grouin | Julien Tourille | Zheng Zhang
Anna Colli | Benedictus Kent Rachmat | Eve Sauvage | Delphine Battistelli | Thomas Gerald | Cyril Grouin | Julien Tourille | Zheng Zhang
We describe the participation of team TFB in SemEval-2026 Task 4 on narrative similarity. We explore ColBERT-inspired sentence-level late interaction to capture event reordering, compare fine-tuning with synthetic data at multiple difficulty tiers, finding that distribution proximity to the target data matters more than volume and evaluate chain-of-thought prompting. We complement our approaches with a human annotation study (Krippendorff’s alpha=0.32) confirming the task’s inherent difficulty, an analysis of synthetic data distribution shift explaining why fine-tuning on out-of-distribution data hurts the model’s performance. Despite our tests, we didn’t surpass results of sentence-t5-xxl on Track B and Qwen2.5-7B on Track A. We finally decided to submit these two models for the task.
DeltaSHAP: a Shapley Value Framework for Interpreting Political Ambiguity
Sven-Alexander Gal | Rodica-Ioana Lung
Sven-Alexander Gal | Rodica-Ioana Lung
Political ambiguity and response clarity have become increasingly important research topics in computational social science and natural language processing. In this paper, we present a solution to the SemEval 2026 Task 6 "Clarity" Challenge. We propose a novel framework that employs TF–IDF representations and Shapley-value–based feature selection for multi-class classification. Shapley-based feature importances are used both for post-hoc explanation and as an active mechanism for label-specific vocabulary selection. For each label, features exceeding a predefined threshold are retained, label-specific vocabularies are filtered through set differences, and independent one-versus-all classifiers are trained using specific features. Experimental results show that threshold tuning substantially impacts performance, with the best performance achieved at intermediate threshold values. Our findings demonstrate that using the game-theoretic feature selection provides an interpretable approach to clarity classification, offering a flexible methodology for ambiguity-sensitive text analysis.
INFOTEC-NLP at SemEval-2026 Task 9: Comparing Regional Transformers and Bag-of-Words Approaches for Polarization Detection in Spanish
Eduardo C. C. Hernandez-Garcia | Guillermo Ruiz | Mario Graff
Eduardo C. C. Hernandez-Garcia | Guillermo Ruiz | Mario Graff
Polarization detection in short texts is a challenging and relevant problem in Natural Language Processing, particularly in social media environments where regional variationsand subtle discursive nuances converge. Inthis paper, we describe our participation inSubtask 1 (Spanish) of SemEval-2026 Task 9(Naseem et al., 2026a), which focuses on binary polarization classification. We evaluatetwo main strategies: lexical models based onBag-of-Words representations and regionallypre-trained Transformer models for Spanish. Inaddition, we explore a logistic stacking framework that combines lexical and contextual representations. Our experiments show that regionally adapted Transformers generally outperform purely lexical approaches, with BILMALATachieving the strongest performance in this task.The results highlight the importance of regionally aligned pre-training on social media datafor effective polarization detection in Spanish.
Aatman at SemEval-2026 Task 9: Transfer Learning for Multilingual Polarization Detection
Aatman Vaidya
Aatman Vaidya
This paper describes our system for Subtask 1 of SemEval-2026 Task 9: POLAR, which focuses on multilingual polarization detection. The task is formulated as a binary classification problem across 22 languages drawnfrom diverse online platforms and real-world events. We investigate three complementary approaches: supervised fine-tuning of multi-lingual encoder-only transformer models, zero-and few-shot classification using large language models (LLMs), and transfer learning from related harmful language tasks such as hate speech, toxicity, abusive language, and gender-based violence. Among the supervised models, mDeBERTa achieved the strongest baseline performance. Prompt-based methods with open-weight LLMs showed limited effectiveness, particularly in zero-shot settings. The best resultswere obtained using transfer learning, where the model was first fine-tuned on related task datasets and then adapted to the polarizationtask, achieving a Macro-F1 score of 0.81. Our findings indicate that supervised multilingualencoders remain highly effective for polarization detection and that incorporating related harmful language tasks can substantially improve performance, especially for nuanced and context-dependent expressions of polarization.
ZYC at SemEval-2026 Task 5: Application of BERT-based Contextual Embeddings Similarity for WSD
Sunny Zhou | Jordan Youner | Dean Cahill
Sunny Zhou | Jordan Youner | Dean Cahill
We investigate contextual embedding manipulation for Word Sense Disambiguation (WSD)as part of SemEval-2026 Task 5. We propose four approaches built on BERT-like pretrainedmodels, experimenting with the informativeness of similarity calculations and classificationmethods. We introduce scratch-trained cross-attention mechanisms inspired by GLiNER to compute similarity between definition or synonym representations and the full context. Our best performance achieved 57% accuracy with a Spearman correlation of 0.20. Our results suggest that finetuning strategy and trainng curriculum matter more than pretrained model choice for this novel task, and we identify several directions for future improvement. View our code base at: https://github.com/heliosraz/SemEval52026
MINDS at SemEval-2026-Task 1: Enhancing Humor Generation through RAG and Synthetic DPO Alignment
Sina Eskandari | Seyed Amirreza Mousavi | Amirreza Rahimi | Mona Pouresmaeil | Marcello Vitaggio | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Sina Eskandari | Seyed Amirreza Mousavi | Amirreza Rahimi | Mona Pouresmaeil | Marcello Vitaggio | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Humor generation presents significant challenges due to subjectivity and the limitations of automatic metrics. In this work, we address Task 1 of SemEval 2026 (Subtask A) by evaluating three instruction-tuned models (Llama 3.1, Gemma 2, and Qwen 2.5) via a round-robin LLM judging framework. We investigate the impact of Retrieval-Augmented Generation and Direct Preference Optimization (DPO) on performance. Our results identify Llama 3.1 as the strongest baseline and demonstrate that DPO consistently improves humor quality across configurations. These findings confirm the efficacy of LLM-based judging as a practical training signal for optimizing subjective generation tasks.
uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking
Simon Lupart | Kidist Mekonnen | Zahra Abbasiantaeb | Mohammad Aliannejadi
Simon Lupart | Kidist Mekonnen | Zahra Abbasiantaeb | Mohammad Aliannejadi
This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM–based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.
Team CV at SemEval-2026 Task 4: Prompting LLMs and Benchmarking Embedding Models for Narrative Story Similarity
Chandan Kumar R S | Vinay Ulli
Chandan Kumar R S | Vinay Ulli
This paper describes Team CV’s systems forSemEval-2026 Task 4: Narrative Story Sim-ilarity and Narrative Representation Learn-ing (Hatzel et al., 2026). For Track A (com-parative judgment), we explore five prompt-ing strategies—zero-shot, chain-of-thought,structured feature extraction, pairwise scor-ing, and few-shot—and QLoRA fine-tuningof smaller models. For Track B (narrativeembeddings), we benchmark twelve dedicatedtext embedding models of varying dimen-sionality (384–4096) spanning open-source(E5-Large-v2, BGE, GTE, Qwen3 Embed-ding) and closed-source (OpenAI, Gemini,Mistral) families, and fine-tune Qwen3 Em-bedding 4B on task-specific triples. Few-shot prompting with Qwen-2.5 7B (64.00%)outperforms all fine-tuned variants (best57.50%) on Track A; scaling to LLaMA-3.3-70B yields 75.00%. On Track B, Ope-nAI text-embedding-3-large (3072-d) achieves the best dev accuracy (67.00%),while fine-tuning Qwen3 Embedding 4B(2560-d) on synthetic triples slightly de-creases accuracy. Our final submission—LLaMA-3.3-70B (3-shot) for Track A andtext-embedding-3-large for Track B—achieves 70.75% and 64.50%, exceeding theGPT-4o-mini and STORY-EMB baselines respec-tively.
DANGNT@SGU at SemEval-2026 Task 1: A Two-Stage Mistral Generator with DistilBERT Reranking for English Humor Generation
Tan Loc Nguyen | Dang Tuan Nguyen
Tan Loc Nguyen | Dang Tuan Nguyen
We describe DANGNT@SGU’s system for the English track of SemEval-2026 Task 1 (MWAHAHA), Subtask A (text-based humor generation). Our pipeline combines a two-stage QLoRA-adapted generator based on mistralai/Mistral-7B-Instruct-v0.2 with a DistilBERT reranker trained to distinguish jokes from non-jokes. The generator is first adapted on a raw joke corpus for general humor style, then further tuned on synthetic task-format instruction–response pairs for Word Inclusion and News Headlineprompts. At inference time, we generate five candidates per input, optionally enforce lexical constraints for Word Inclusion prompts, and rerank candidates with the classifier. In the official English Subtask A results, our team DANGNT@SGU obtained Elo 962 (95% CI: 926–986), ranking 13th. The system is practical, reproducible, and based entirely on open models and public data.
LingoResearchGroup at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection
Pritam Kadasi | Anuj Tiwari | Mayank Singh
Pritam Kadasi | Anuj Tiwari | Mayank Singh
Our submission presented in this paper is for SemEval-2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection and it covers all three subtasks: (1) binary polarization detection, (2) polarization type classification and (3) polarization manifestation identification. We adopt a systematic approach of research on short designed prompts by considering twelve designed prompts that are different in terminology clarity, detail of the definition, guidance of reasoning and in-context examples use. The experiments are conducted using aya-101 and Gemma3-27B, with the latter chosen for the submission at the end of the development through performance considerations. Our system has an average macro level \textbf{F1-score of 0.762 on Subtask 1, 0.587 on Subtask 2 and 0.444 on Subtask 3} with the average accuracy of 0.819, 0.678 and 0.498, respectively, on the official test set averaged among 22 languages, respectively. With cross-task and cross-lingual analysis, we demonstrate that prompt-based approaches can be used effectively to detect coarse-grained polarization but encounter more and more difficulties as far as fine-grained and multi-label sociolinguistic classification is concerned.
ABARUAH at SemEval-2026 Task 9: Multilingual Polarization Detection across Seven Indic Languages using Qwen3
Arup Baruah
Arup Baruah
Online polarization creates division within the society. As such, it is important to detect and remove polarized messages from social media. This study presents fine-tuned Qwen3-8B Large Language Model (LLM) based models to identify online polarization, its specific categories, and its manifestation types. This study used Quantized Low-Rank Adaptation (QLoRA) to fine-tune the model in seven Indic languages: Bengali, Hindi, Nepali, Oriya, Punjabi, Telugu, and Urdu. The experimental results demonstrate the efficacy of this approach, achieving macro F1-scores of 0.82, 0.78, 0.90, 0.76, 0.78, 0.87, and 0.79, respectively, for polarization detection. The proposed model surpassed the established baseline systems in several of the subtasks, suggesting that parameter-efficient fine-tuning is a viable and powerful strategy for addressing linguistic diversity in low-resource and high-variability Indic language datasets. To leverage cross-lingual transfer, a unified model was developed by fine-tuning on a concatenated dataset of seven Indic languages. This approach proved superior to standalone language-specific models, yielding substantial improvements in F1-score (most notably a 28.76 point gain in Subtask 2 for Punjabi language). This provides strong evidence for the benefits of cross-lingual knowledge transfer in low-resource settings.
DUTIR at SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Tala Borjigin | Liang Yang
Tala Borjigin | Liang Yang
This paper presents our approach for SemEval 2026 Task 4. Our method leverages a large language model fine-tuned via Low-Rank Adaptation, incorporates data cleaning, and employs a multi-prompt strategy, all trained on the official synthetic dataset. Evaluated on Track A, our system achieved an official score of 0.70, representing a reasonable performance under the given task constraints. In addition, we explore an alternative contrastive learning framework originally designed for Track B, where narrative-structure embeddings are learned and subsequently applied to Track A via similarity comparisons. Our analysis suggests that direct supervised adaptation may be more suitable for narrative reasoning tasks.
j10official at SemEval-2026 Task 1: Neurosymbolic Humor Generation via GTVH-Guided LLM Decomposition
Jatin Agrawal | Radhika Mamidi
Jatin Agrawal | Radhika Mamidi
We present a neurosymbolic pipeline for computational humor generation grounded in the General Theory of Verbal Humor. The system constructs the joke in five sequential stages: context analysis, humor architecture (identifying core incongruity), delivery strategy, content writing, and pairwise judging, orchestrated through the DSPy framework. The system generates four candidate jokes per input with independent humor strategies, then selects the best through knockout tournament-style evaluation. Despite using Gemma 3 27B, a model with roughly 20× fewer total parameters than frontier systems, our approach achieves competitive results across all five subtasks of SemEval- 2026 Task 1 (MWAHAHA), placing 2nd in two subtasks. We argue that these results demonstrate the viability of structured, theory-driven decomposition for solving complex tasks and that how a model reasons about humor is just as important as how large the model is.
BertKittens at SemEval-2026 Task 3: Multi-Domain Aspect Sentiment with BERT/DeBERTa Ensembles for VA Regression and Aspect–Opinion–VA Triplets
Arseny Sukhodolsky | Ruslan Salimgareev | Tatiana Ianshina
Arseny Sukhodolsky | Ruslan Salimgareev | Tatiana Ianshina
Our system is built on transformer encoders (BERT and DeBERTa) fine-tuned in a multi-task learning framework. For the regression subtask (evaluated with RMSE), we jointly predict Valence–Arousal (VA) scores and token-level opinion spans using a shared encoder with task-specific output heads. This formulation introduces auxiliary supervision at the token level, which stabilizes training and improves regression accuracy compared to single-task optimization.When gold abstracts and opinion annotations are provided, our models achieve strong performance. However, in fully end-to-end settings requiring automatic span extraction, performance degrades substantially due to error propagation from token-level predictions.Our findings highlight the benefits of joint affective regression and span modeling, while exposing the limitations of transformer-based sequence labeling under strict end-to-end evaluation constraints.
NarSiL at SemEval-2026 Task 4: A Multi-Expert, Multi-Pathway System for Narrative Story Similarity
Bogdan Octavian Grecu | Costin Chiru | Oana Cocarascu
Bogdan Octavian Grecu | Costin Chiru | Oana Cocarascu
We present NarSiL (Narrative Similarity Learners), our system for SemEval-2026 Task 4 Track A on Narrative Story Similarity. NarSiL employs a two-stage architecture: a Mixture-of-Experts (MoE) initial classifier that also leverages supermajority voting across three large language models (Gemma-3-12B, GPT-3.5-turbo-instruct, and Gemini-2.5-Flash) over multiple runs, followed by a structured three-pathway fallback for ambiguous cases. The three pathways correspond directly to the task’s three core similarity components, abstract theme, narrative outcome, and course of action. Each path yields a similarity score corresponding to its respective component, and the scores are then combined through a weighted aggregation step. NarSiL achieves 64.25% accuracy on the official test set. An improved score of 70.25% is obtained by considering only the supermajority voting of GPT, followed by the previously described fallback.
Sagarmatha at SemEval-2026 Task 9: Heterogeneous Ensembling and Hierarchical Task Conditioning for Multilingual Latent Distributional Divergence Modeling
Sujal Maharjan | Astha Shrestha | Pratikshya Shrestha
Sujal Maharjan | Astha Shrestha | Pratikshya Shrestha
The digital public square is increasingly fragmented by affective polarization, requiring computational systems capable of identifying discursive strategies such as dehumanization and vilification. This paper presents Sagarmatha, the system developed for SemEval-2026 Task 9. We propose a heterogeneous ensemble architecture that addresses the limitations of standard transformer fine-tuning across 22 languages. Our approach integrates mDeBERTa-v3, ReMBERT, LaBSE, mmBERT, and XLM-RoBERTa, through two primary architectural pillars: learnable weighted layer pooling and hierarchical task conditioning. While our final submission (a broad ensemble, R3) demonstrated high stability on the leaderboard, our primary architectural configuration (Weighted Polyglot, R1) yielded superior performance in complex multi-label tasks. The system ranked 1st globally in English and Hausa manifestation identification, and 1st in Telugu detection (2nd in categorization). All code and resources are available at https://github.com/SUJAL390/SagarmathaatSemevaltask9.git.
Archaeology at SemEval-2026 Task 13: Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection
Jany-Gabriel Ispas | Sergiu Nisioi
Jany-Gabriel Ispas | Sergiu Nisioi
This paper describes the system submitted by team Archaeology to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model).Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask.For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset.For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A and 0.422 on Subtask-B.
B B at SemEval-2026 Task 6: A RoBERTa-based Model with NLI-derived Semantic Features for Clarity-Level Classification of Political Question Evasion
Chi-Bo Lin | Boyang Yu
Chi-Bo Lin | Boyang Yu
Equivocation and ambiguity are common in political interviews, where public figures often avoid directly answering challenging questions. We present our submission to SemEval-2026 Task 6, Subtask 1 on English political response clarity classification. Our system builds on RoBERTa and incorporates NLI-derived semantic features to distinguish Clear Reply, Ambivalent, and Clear Non-Reply responses. To address class imbalance and performance instability, we explore class weighting, multi-seed ensembling, and a hierarchical two-stage framework with threshold tuning. Our best model achieves 60% macro-F1 on the official test set and 64% macro-F1 on an additional evaluation set, demonstrating stable performance across splits. Our results show that carefully engineered smaller models, combined with structured semantic features and imbalance-aware training, provide an effective and computationally efficient solution under limited training data.
BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection
Atharva Gupta | Dhruv Kumar | Yash Sinha
Atharva Gupta | Dhruv Kumar | Yash Sinha
The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post task submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802. Code released publicly.
MINDS at SemEval-2026-Task 13: Robust Detection of Machine-Generated Code under Distribution Shift
Giorgia Rosalia Buccelli | Antonella Coviello | Alexandra Elena Holota | Marco Scaglione | Simone Scalora | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Giorgia Rosalia Buccelli | Antonella Coviello | Alexandra Elena Holota | Marco Scaglione | Simone Scalora | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
The growing use of large language models for code generation makes distinguishing machine-generated code from human-written code increasingly difficult, especially under distribution shifts in language, domain, and generator family. SemEval-2026 Task 13 targets this challenge through three subtasks: binary detection, multi-class authorship attribution, and hybrid/adversarial code detection.In this paper, we conduct an empirical study across all subtasks, comparing a variety of approaches: frozen encoder representations, feature-based classifiers, fine-tuned transformer models, post-hoc calibration, and probability-level ensembling. Our results show a consistent generalisation gap: strong in-domain validation scores substantially overestimate performance on shifted test conditions.The code is available at https://github.com/AlexandraElena-Holota/SemEval-2026-Task13.git
JCT at SemEval-2026 Task 4: A Multi-Method Approach to Narrative Story Similarity
Dvori Rosenfeld | Rinat Walles | Chaya Liebeskind
Dvori Rosenfeld | Rinat Walles | Chaya Liebeskind
Narrative similarity detection involves under-standing the underlying structure of a storyrather than just matching similar words orphrases. This paper details our multi-strategyapproach to the SemEval-2026 Task on Nar-rative Similarity, which requires identifyingwhich of two candidate stories most closelyresembles an anchor story based on three di-mensions: abstract themes, the sequence ofevents, and the final outcomes.We developed three distinct but complemen-tary methods to address this challenge. First,we fine-tuned a specialized story-embeddingmodel using parameter-efficient techniques onsynthetic data. Second, we utilized a "Distill-then-Embed" workflow, where a large languagemodel extracts the essential narrative core ofeach story before computing similarity. Third,we employed direct zero-shot prompting to al-low a high-reasoning model to compare thestories organically.Our analysis reveals that each approach excelsat different types of narrative comparisons, andtheir combination leads to robust performance.We demonstrate the importance of narrative dis-tillation in removing surface-level distractorsand the effectiveness of carefully engineeredprompts in guiding language models to focuson narrative structure
Tifin India at SemEval-2026 Task 5: Semantic Bridge: Augmented Encoding for Word Sense Plausibility
Pawan Rajpoot
Pawan Rajpoot
We present a hybrid system for SemEval 2026Task 5: Rating Plausibility of Word Senses inAmbiguous Stories. Our approach reframesLLMs as feature generators rather than directpredictors. We combine two subsystems: onethat appends LLM-generated hints to the in-put context and trains an encoder-based regres-sion model, and another that feeds structuredhints from multiple LLM configurations into alightweight regression ensemble. We generatemultilingual enrichments to probe LLMs forcomplementary signals and take advantage ofthe fact that translation into certain languagesimplicitly disambiguates word senses, makingthe encoder more robust. The 50/50 ensem-ble achieves 859/930 (92.37%) accuracy withSpearman ρ= 0.8384 on the test set, exceed-ing the estimated human ceiling of 89.2%. Thesame LLM enrichments, processed through fun-damentally different paradigms (tabular regres-sion vs. full-text encoding), produce comple-mentary errors that cancel under ensembling.Notably, simple 50/50 averaging captures thisgain without any learned combiner, suggest-ing that
GigitAI at SemEval-2026 Task 8: Hybrid Sparse-Dense Retrieval and Zero-Shot Generation for Multi-Turn Conversational RAG
Saran Krishnasamy | Inez Wihardjo
Saran Krishnasamy | Inez Wihardjo
We describe our system for SemEval-2026 Task 8 (MTRAGEval) on multi-turn conversational RAG. Our approach combines hybrid retrieval (fusing SPLADE-v3 learned sparse representations with dense embeddings via Reciprocal Rank Fusion) with a fine-tuned cross-encoder reranker and zero-shot LLM generation using Claude Opus 4.5. We systematically evaluate 56 retrieval configurations across 4 domains, and 5 generation strategies across 5 LLMs. Our findings show that: (1) SPLADE-v3 with dataset rewrites substantially outperforms BM25 across all configurations, (2) simple zero-shot prompting matches sophisticated strategies like Self-RAG and CRAG, and (3) performance varies significantly by answerability class. On the test set, we achieve rank 5/29 on Task C (end-to-end RAG, H=0.5564), rank 7/26 on Task B (generation, H=0.7495), and rank 13/38 on Task A (retrieval, nDCG@5=0.4782). Our analysis reveals strong performance on answerable queries (H=0.685) but degradation on underspecified queries (H=0.254).
GheGheGhe at SemEval-2026 Task 11: Decoupling Logic from Belief with Bias-Targeted Fine-Tuning and Neuro-Symbolic Syllogistic Reasoning
Razvan Gogu | Stefan Placintescu | Sofia Vultur
Razvan Gogu | Stefan Placintescu | Sofia Vultur
This paper presents a multi-paradigm approach to the first two subtasks of SemEval-2026 Task 11. For the first subtask, we explore two complementary strategies: a Llama-3 8B PEFT Majority Vote Ensemble, trained with bias-targeted augmented data, and a hybrid approach that separates LLM processing from logical reasoning, converting sentences into canonical logical forms for deterministic analysis. The hybrid approach is further extended to the second subtask. Official results placed us 17th in the first subtask and 15th in the second. Post-evaluation analysis indicates that our best model achieved perfect accuracy on the first subtask and revealed several errors in the ground truth data. After identifying certain implementation issues in the second subtask approach, the F1 retrieval score increased to over 98%, which would place us within the top 5 on the leaderboard.
contestant001 at SemEval-2026 Task 13 Stylometric and TF-IDF-Based Detection of Machine-Generated Code
Bora Ozaylar
Bora Ozaylar
Reliable detection of machine-generated codehas become increasingly important for aca-demic integrity and software quality as codegeneration is largely being undertaken by largelanguage models. This paper presents our ap-proach to SemEval-2026 Task 13, Subtask A:detecting machine-generated code in a binaryclassification setting, where we propose anensemble approach combining TF-IDF lexi-cal representations with 23 hand-crafted sty-lometric features for binary classification ofAI-generated code. Our system aims to addressthe challenge of cross-language generalizationby extracting language-agnostic patterns andcombining them with TF-IDF. While we ob-served that transformer-based models (Code-BERT, UniXcoder) noticeably underperformedunder distribution shift, our analysis revealedthat AI-generated code exhibits distinct sty-lometric patterns and our TF-IDF ensembleachieved 0.5175 Macro F1 on the task submis-sion.
VerbaNexAI at SemEval-2026 Task 4: Two-Stage Narrative Similarity via Fine-Tuned Bi-Encoder with MLP Ensemble
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 4: Narrative Similarity, a shared task on assessing semantic relatedness between short narrative texts. The task comprises two tracks: Track A requires selecting which of two candidate stories is more similar to an anchor, and Track B requires producing fixed-size story embeddings whose cosine similarity reflects narrative relatedness. We propose a unified two-stage system built on Qwen3-Embedding-0.6B. The first stage fine tunes the encoder as a bi-encoder with a 512 dimensional projection head using a composite loss combining margin ranking, pairwise softmax, and multiple negatives ranking objectives. The second stage trains a lightweight MLP head over frozen bi-encoder embeddings using pairwise interaction features, with k-foldcross-validation and logit-averaging ensemble inference. The system was trained exclusively on the official supervised data without leveraging the additional 1,900 synthetic triples generated by LLM released by the organizers. Al though the system ranked first on both tracks in the development phase, its performance did not transfer to the official test set, where it ranked 47 on Track A and 22 on Track B.
CultRAG at SemEval-2026 Task 7: Hybrid Sparse-Dense Retrieval with Entity-Centric Knowledge Bases for Cultural MCQ Answering
Aditya Singh | Rickarya Das
Aditya Singh | Rickarya Das
We developed CultRAG, a trust-weighted Retrieval-Augmented Generation system for BLEnD Track 2 (SemEval-2026 Task 7), targeting culturally grounded multiple-choice QA across 30 countries. Built on Llama-3.1-8B-Instruct, the six-phase pipeline integrates entity extraction via spaCy, hybrid BM25+FAISS retrieval with Reciprocal Rank Fusion, country-aware filtering, keyword-based intent detection, tiered prompt routing, anti-leak quality filtering to suppress answer-anchoring artifacts, and trust-weighted document reranking with source-credibility tiers. Ablation analysis across eight cumulative configurations and per-country decomposition identify which components contribute and where retrieval helps versus hurts, informing future directions for confidence-conditioned selective retrieval.
uircis at SemEval-2026 Task 8: A Unified Lightweight Pipeline for Multi-Turn RAG Evaluation
Jiaqi Zhang | Wenbin Duan | Yingqi Zhang | Yan Li | Binyang Li
Jiaqi Zhang | Wenbin Duan | Yingqi Zhang | Yan Li | Binyang Li
We submit a system description paper for SemEval-2026 Task 8 (MTRAGEval), covering both Subtask A (retrieval) and Subtask B (generation). Our approach is a lightweight, fully reproducible multi-turn RAG pipeline using open-weight models: Qwen2.5-7B-Instruct for query rewriting and grounded answer generation, BGE-M3 for dense retrieval, and BGE-Reranker-v2-M3 for cross-encoder reranking. We report official test performance, conduct ablation experiments to quantify the impact of rewriting and reranking across domains, and provide error analysis using the organizers’ analytics and answerability classes, highlighting key failure modes in multi-turn retrieval specificity and grounded generation.
AKCIT-UFG at SemEval-2026 Task 8: Structured Chunking and Optimized Query Reformulation for Efficient Multi-Turn Retrieval
David Ferreira | Wilson Ramos | Priscila Ribeiro | Emanuel Passinato | Diogo Fernandes | Arlindo Filho
David Ferreira | Wilson Ramos | Priscila Ribeiro | Emanuel Passinato | Diogo Fernandes | Arlindo Filho
This submission investigates efficient multi-turn retrieval under constrained computational settings. We analyze how passage granularity and conversational query rewriting affect retrieval effectiveness across four benchmark domains. Using compact, locally deployable components, we show that smaller passage segmentation improves early-rank performance and that lightweight keyword-oriented query reformulation substantially enhances dense retrieval quality.Importantly, we observe that rewriting interacts differently with encoder backbones: some compact models benefit significantly from increased query specificity, while others degrade, indicating sensitivity to rewrite-induced distribution shifts. Our findings demonstrate that competitive multi-turn retrieval does not require large proprietary models, but can emerge from principled structural and preprocessing design choices. The results highlight the importance of aligning chunking strategy, rewriting policy, and encoder characteristics in resource-efficient MT-RAG systems.
INF-rsrs at SemEval-2026 Task 1: Is the best really better? The limits of creative work in the era of LLMs
Guilherme Bazzo | Eduardo Faé | Júlia Junqueira | Higor Moreira | Lucas Rafael Costella Pessutto
Guilherme Bazzo | Eduardo Faé | Júlia Junqueira | Higor Moreira | Lucas Rafael Costella Pessutto
Generating humor is a complex and challenging task for Large Language Models (LLMs), requiring both linguistic creativity and strict adherence to constraints. This paper presents INF-rsrs, our solution for SemEval 2026 Task~1: Humor Generation, which tasks models with creating jokes from headlines and word pairs without labeled data. We propose a two-stage framework: a production stage and a selection stage. The production stage employs diverse model families and hyperparameter configurations to generate a wide range of candidate jokes, with each candidate generated by an LLM prompted in the role of a comedian under structured constraints to ensure relevance and humor. Our system was designed to substantiate our claim that the direct use of LLMs in creative works, such as humor generation, hits a hard ceiling that is inescapable through simple prompting. Our proposed system tied in first place in the task ranking, obtaining a top-tier performance.
CodeDet-NITS at SemEval-2026 Task 13: AI Code Authorship Detection Beyond Truncation
Lekkala Sai Teja | Annepaka Yadagiri | Kshitij Patiyal | Sangam Sai Anish | Partha Pakray
Lekkala Sai Teja | Annepaka Yadagiri | Kshitij Patiyal | Sangam Sai Anish | Partha Pakray
Automatically determining whether source code is human written or produced by a specific family of large language models is becoming essential for reliable assessment, provenance tracking, and dataset curation. We present a lightweight yet competitive system for SemEval 2026 Task 13 Subtask B, which requires attributing each snippet to one of eleven classes: human or one of ten LLM families. Our method repurposes code oriented instruction tuned backbones from the Qwen2.5 Coder series as sequence classifiers and adapts them using QLoRA, combining frozen low precision weights with low rank trainable adapters to reduce memory and compute overhead. The core design choice addresses long snippets without losing evidence. Instead of truncating to a fixed context, we apply an overlapping sliding window strategy that expands long examples into multiple fixed length windows during training, all sharing the same label. For validation and test, windows are generated on the fly and their evidence is aggregated by averaging logits to yield a single prediction per snippet, enabling token complete use of the input while keeping inference stable. Our final submission ranked 8th on the official Subtask B test set leaderboard.
NIT-Agartala-NLP-Team at SemEval-2026 Task 9: A Weighted Soft-Voting Ensemble Framework of Fine-Tuned LLMs for Binary and Multi-Label Polarization Detection
Shivam | Manish Kumar | Anupam Jamatia
Shivam | Manish Kumar | Anupam Jamatia
This paper presents the NIT-Agartala-NLPTeam’s submission to SemEval-2026 Task 9on polarization detection in textual data. Thetask comprises two subtasks: (i) binary classification to distinguish polarized from nonpolarized content, and (ii) multi-label classification to identify the specific type(s) of polarization. We propose a weighted soft-votingensemble framework that integrates multiplefine-tuned large language models (LLMs). Theprobabilistic outputs of the individual models are combined using weighted averagingto effectively leverage their complementarystrengths and enhance overall performance.Our system achieved a test macro F1-score of78.6 (26th out of 44 teams) in Subtask 1 and46.0 (18th out of 29 teams) in Subtask 2.
uir-cis-7 at SemEval-2026 Task 7: Zero-Shot Chain-of-Thought Reasoning for Cross-Cultural Daily Knowledge
Jianning Gao | Xianling Mao | Shumin Shi | Duanzhi Zhaxi | Yingbo Sun | Xiandeng Li | Binyang Li
Jianning Gao | Xianling Mao | Shumin Shi | Duanzhi Zhaxi | Yingbo Sun | Xiandeng Li | Binyang Li
SemEval-2026 Task 7 evaluates the ability of Large Language Models (LLMs) to reason about diverse daily knowledge across 30 geographic regions. In this paper, team uir-cis-7 approaches this challenge not merely as an accuracy optimization problem, but as a diagnostic probe to evaluate the representational limits of LLMs without fine-tuning. To address Western-centric bias and the "overthinking penalty" frequently observed in high-resource contexts, we introduce a Two-Tier Dynamic Routing framework. Based on cultural resource density, queries are routed either to a direct-answer pathway or a complex reasoning pathway. The complex pathway utilizes an Anti-Bias Persona-Conditioned Chain-of-Thought enhanced with Knowledge Anchoring and multi-path Self-Consistency voting to mitigate majority-culture heuristics. Evaluated using a strict macro-average metric, our system achieved an overall accuracy of 89.02% on the official leaderboard. Our fine-grained evaluation and theoretical error analysis quantify the epistemological boundaries of prompt-based alignment, proving our dynamic strategy effectively rescues marginalized cultural knowledge while exposing persistent instances where safety-aligned models project Western progressive norms onto traditional contexts. Furthermore, cross-model validation on open-source architectures explicitly confirms our framework’s generalizability.
HHU-SyLo at SemEval-2026 Task 11: Logic in the Loop – Hybridizing LLMs and Theorem Provers for Robust Formal Reasoning
Wiebke Petersen | Cherine Jaziri | Diem Tran
Wiebke Petersen | Cherine Jaziri | Diem Tran
We present our system for SemEval-2026 Task 11 on reasoning disentanglement, separating syllogistic validity from semantic plausibility. We compare direct neural inference against two neuro-symbolic pipelines: translation to first-order logic and to syllogistic triples. By offloading inference to symbolic theorem provers, these hybrid models effectively mitigate content bias and improve logical fidelity.
UMUSP at SemEval-2026 Task 9: Mitigating Cross-Lingual Interference via Selective Multilingual and Multitask Specialization
Julio Cesar Fuganti | Tulio Ferreira Leite Da Silva | Adelino Gala | Francisco S. Marcondes | José Machado | Paulo Novais
Julio Cesar Fuganti | Tulio Ferreira Leite Da Silva | Adelino Gala | Francisco S. Marcondes | José Machado | Paulo Novais
This paper proposes a selective multilingual and multitask fine-tuning strategy for online polarization detection that improves cross-lingual stability over fully joint training. Covering all three subtasks — polarization detection (POLARDETECT), polarization type classification (POLARTYPE), and rhetorical manifestation identification (POLARMANIFEST) — across all 22 languages of the shared task, the approach introduces controlled specialization, where languages and subtasks are grouped empirically and separate specialist models are fine-tuned for each subset. Restricting parameter sharing substantially improves performance even without ensemble averaging, whereas ensembling jointly trained models fails to mitigate instability. The final specialist ensemble improves Task 3 macro-F1 from 0.3330 to 0.4920 and reduces cross-lingual dispersion (CV: 0.613 → 0.321). Under the official ranking framework, the system ranks 7th among 16 submissions with complete multilingual and multitask coverage and remains within 5% of the best system in 37.70% of evaluation conditions.
ASTraNet at SemEval-2026 Task 13: Not All Code Looks the Same: Multi-View Structural and Semantic Detection of Machine-Generated Code
Ruwad Naswan | Dipit Saha | Md. Kabir | Nabiha Tahseen
Ruwad Naswan | Dipit Saha | Md. Kabir | Nabiha Tahseen
The growing adoption of large language models for code generation poses challenges for code quality, security, and authorship verification—particularly when test conditions involve unseen programming languages, generators, or application domains. We present our system, which combines three code-pretrained transformer encoders (CodeT5p-220M, CodeBERT, UniXcoder) with a structure-first Flow-Augmented AST (FA-AST) encoder implemented as a Gated Graph Neural Network. On Subtask A our best single model achieves macro F1 of 0.559; a post-competition layered rank-fusion ensemble across all three encoders raises this to 0.643. On Subtask C we obtain 0.585 officially; a three-stage ensemble combining neural probabilities with LightGBM-based features and class-priority routing raises this to 0.652. Our contributions include a language-agnostic structural detector, a diversity-driven rank-fusion strategy exploiting low inter-model correlation for binary classification, and a meta-learner stacking pipeline for multi-class detection under distribution shift.
RPI Team at SemEval-2026 Task 3: An LLM-Encoder Ensemble for Coarse-to-Fine Valence-Arousal Sentiment Prediction
Mohammed Shahid Modi | Boleslaw Szymanski
Mohammed Shahid Modi | Boleslaw Szymanski
We present our coarse-to-fine Valence-Arousal (VA) ensemble system for subtask 1 of task 3 (DimABSA) which covers aspect-level VA prediction. We use a pair of trained Qwen 3 8B LoRA-tuned LLMs to predict coarse bins between 1 and 8, providing ordinal VA guidance signals along with distributional features. We then train an instruction-style, multilingual E5 encoder model with a multitask head using these LLM-derived guidance features to produce continuous VA predictions. At inference time, the same guidance signals are generated for the test set by the trained LLMs and fed into the trained encoder. This approach leverages the LLM as a high-level prior while relying on the encoder for precise calibration across languages and domains. Our system achieves an RMSEVA of 1.20 across six languages and five domains. We compare the joint VA model to separated valence and arousal models trained on coarsened ground truth data, showing that it outperforms them, particularly on arousal correlations.
CLRG at SemEval-2026 Task 3: One Size Does Not Fit All: A Resource Adaptive Framework for Dimensional Sentiment Regression
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Predicting continuous Valence and Arousal scores across diverse languages poses significant challenges due to typological differences and the difficulty of modeling affective intensity. We introduce AdaptStance, a parameter-efficient framework designed for the SemEval-2026 Task 3 benchmark. To address cross-lingual disparities, AdaptStance routes inputs through resource-specific pipelines: direct regression with a hybrid concordance loss for high-resource languages, and an auxiliary multi-task mechanism to stabilize regression in low-resource and non-Western contexts. Architectural analysis reveals that decoupling task heads benefits morphologically related languages, whereas joint representations act as crucial regularizers for distant language families. Ultimately, this lightweight approach achieves competitive performance over generative baselines, demonstrating the efficacy of targeted architectural alignment while identifying Valence as the primary bottleneck in continuous affect prediction. Our code is available on GitHub.
PolarMind at SemEval-2026 Task 9: Leveraging LaBSE with Progressive Curriculum Learning for Multicultural Polarization
Sandeep s | Mothish M | Sachin Sundar
Sandeep s | Mothish M | Sachin Sundar
Detecting online polarization remains a critical challenge, particularly in multilingual and multicultural on texts where intergroup hostility is prevalent. The problem is particularly challenging due to the data scarcity for these tasks in the low-resource languages. Identifying such phenomena has become an activearea of research and is addressed in SemEval 2026 Task 9: Multilingual, Multicultural Online Polarization Detection. To address this problem we propose an architecture that leverages LaBSE embeddings—an unconventional choice typically reserved for retrieval tasks—toobtain strong cross-lingual learning which enhances scores in low-resource language by ascore up to 0.2 macro F1. Furthermore, we provide a comprehensive ablation study evaluatingthe performance of diverse encoder models in the Qwen model family within a retrieval-basedprompting framework.
CredenceAI at SemEval-2026 Task 10: A Span-Consistency Network with Cross-Marker Attention for Conspiracy Marker Extraction
Ishaan Karan
Ishaan Karan
We present a Span-Consistency Network (SCN) for conspiracy marker extraction in English social media text. The task requires identifying character-level spans for five marker types (Actor, Action, Effect, Evidence, and Victim) under overlap-based Macro F1 evaluation. Standard token-level classifiers often produce fragmented spans, ignore inter-marker dependencies, and struggle with severe class imbalance.Our approach addresses these challenges through three components. First, a Span Consistency Layer (SCL) propagates span-level confidence signals to encourage coherent boundary formation. Second, Cross-Marker Attention (CMA) models co-occurrence patterns between marker types via a learned correlation matrix. Third, we introduce Span Count Regularization (SCR), a total-variation-based constraint that aligns soft token probabilities with the expected number of discrete spans, mitigating prediction collapse under threshold decoding.Built on DeBERTa-v3-large and trained with a recall-biased Tversky loss, our system is ensembled across five stratified folds. It achieved a Macro F1 of 0.24 on the official test set, placing second among participating teams. Ablation studies show that SCR plays a critical role in maintaining span structure, particularly for low-frequency and long-span markers.
Models Without Borders at SemEval-2026 Task 7: Bridging Cultural Contexts with Search-Grounded QA
Swetha Krishna Sriram | Nirupama Sekar
Swetha Krishna Sriram | Nirupama Sekar
We present our submission to SemEval-2026 Task 7, focusing on the MCQ track, where models must identify culturally specific answers across language-region locales. Our system augments a compact open-source model with locale-targeted web retrieval at inference time, requiring no task-specific fine-tuning, and places 10th on the leaderboard. Beyond the submitted system, we explore how retrieval depth and search localization affect performance across locales, finding that localizing search parameters meaningfully shifts the geographic composition of retrieved sources and that gains from retrieval are most pronounced for lower-resource locales. We also investigate whether culturally informed prompt framing can complement retrieval, finding that it does, but only when grounding context is present. Taken together, our results point to inference-time web grounding as a practical path toward more culturally aware NLP under resource constraints.
KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection
Archie Sage | Salvatore Greco
Archie Sage | Salvatore Greco
This paper describes the KCLarity team’s participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
ICI Innolabs at SemEval-2026 Task 13: Sliding Windows Meet Code Transformers
Sebastian Balmus | Bogdan Dura
Sebastian Balmus | Bogdan Dura
We describe our system for SemEval-2026 Task 13, Subtask B, which focuses on multi-class authorship attribution for code: given a code snippet, the goal is to predict whether it is human-written or generated by one of ten LLM families. The task presents two central challenges: severe class imbalance and long input sequences that frequently exceed the context length of encoder-based Transformers. To address these issues, we adopt a window-based fine-tuning and inference framework. During training, we randomly sample 512-token windows from each snippet and optimize a class-weighted cross-entropy objective with label smoothing. At inference time, we apply a sliding-window strategy and aggregate window-level logits to obtain a snippet-level prediction. We fine-tune three pretrained code encoders (CodeBERT, UniXcoder, and StarEncoder) under this framework and combine their outputs via majority voting. On the official validation split, our best single model (StarEncoder) achieves 0.60 macro F1. On the final test set, the three-model ensemble reaches 0.41 macro F1, ranking 10th on the leaderboard. Our results demonstrate that window-based modeling combined with imbalance-aware optimization provides a robust and reproducible baseline for multi-class LLM attribution under distribution shift.
K-NLPers at SemEval-2026 Task 7: Multiple LLM Agent Debate System for Everyday Knowledge Across Diverse Languages and Cultures
Jiwoo Song | Sihyeong Yeom | Harksoo Kim
Jiwoo Song | Sihyeong Yeom | Harksoo Kim
This paper presents the K-NLPers system for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. The task extends the BLEnD benchmark to evaluate cultural understanding of language models across more than 30 language-country pairs. Although Large Language Models (LLMs) achieve strong overall performance, they exhibit performance disparities across cultural contexts and tend to produce regionally biased responses. To address this limitation, we propose a continent-based multi-agent debate framework that leverages culture-specific performance differences instead of relying on a single model. For the Short Answer Question (SAQ) track, we employ three agents: a general-purpose model, a continent-specific model, and a country-level or culturally adjacent model. These agents engage in independent generation, mutual refinement, and final adjudication. For the Multiple-Choice Question (MCQ) track, we adopt a debate structure centered on high-performing general-purpose models due to the track’s simpler structure. Our system participated in all language-region pairs and achieved overall scores of 55.75 on SAQ and 88.32 on MCQ. Further analysis reveals that grouping the performance of various individual models by continent explains performance patterns more consistently than language-based grouping, highlighting the importance of cultural and historical context in model generalization.
ShefFriday at SemEval-2026 Task 9: LLM-Based Annotation Methods for Detecting Multilingual, Multicultural and Multievent Online Polarisation
Owen Cook | Meredith Gibbons | Xingyi Song
Owen Cook | Meredith Gibbons | Xingyi Song
This paper presents our findings for SemEval-2026 Task 9. We submit to all three subtasks using an LLM-as-an-annotator strategy, simulating the data annotation process with large language models. We created 30 LLM annotators using persona injection (also known as sociodemographic prompting) and experimented with various annotation aggregation methods, including Dawid-Skene and MACE. To further increase the variability in annotator responses, we used the hatefulness detection task as proxy for identifying polarisation. Our findings indicate that this reframing of the problem is effective for the binary classification of polarisation, but is less effective for finer-grained polarisation detection. For subtasks 2 and 3, majority voting yielded the best overall performance. While our unsupervised approach does not rank as highly as supervised ones, this work provides insight into the utility of persona-based prompting and the issue of LLM annotators exhibiting high intra-model agreement.
REGLAT at SemEval-2026 Task 12: Multi-Strategy Ensemble Reasoning for Event Causality Identification
Mariam Francies | Nsrin Ashraf | Ahmed Fetouh | Asad Khalil | Hamada Nayel
Mariam Francies | Nsrin Ashraf | Ahmed Fetouh | Asad Khalil | Hamada Nayel
This paper describes the multi-strategy ensemble approach that has been used to develop the model submitted to the Abductive Event Reasoning shared task. The proposed model combines semantic similarity, causal pattern recognition, and Large Language Models (LLMs) to identify causal relationships between news events and their causes. Our system achieved competitive performance by integrating semantic embedding-based similarity, explicit causal pattern matching, keyword overlap analysis, temporal alignment scoring, and LLM-enhanced reasoning. Our system achieved accuracies of 65.4\% and 43.2\% on the development set using the LLM-enhanced configuration and the non-LLM ensemble, respectively. The final score using the test set on the leaderboard is 0.3.
NASIMLab at SemEval-2026 Task 9: A Comparative Analysis of Fine-Tuned Small Language Models vs. Generative Large Language Models for Multilingual Polarization Type Detection
Neel Sabhahit | Sanjeevan Selvaganapathy | Mehwish Nasim
Neel Sabhahit | Sanjeevan Selvaganapathy | Mehwish Nasim
The POLAR dataset contains various social media texts that might be polarized (conflict-inducing or dangerously divisive). The task at hand is to identify whether any of the following types of polarization are present: political, racial/ethnic, religious, gender/sexual, and other types across 22 languages. In this paper, we propose a system of fine-tuned language-specific small language models and compare our approach with state-of-the-art large language models on the POLAR dataset. By fine-tuning models for each language, we demonstrate that fine-tuned small encoder-only models consistently outperform large language models, especially for low-resource languages. Our system performs well on this task for most low-resource languages, notably taking the top spot on the leaderboard in Burmese (mya), appearing within the top 10 for 12 languages, and within the top 20 for all remaining languages.
COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
Azwad Anjum Islam | Tisa Islam Erana
Azwad Anjum Islam | Tisa Islam Erana
We present a system for SemEval-2026 Task 5 that predicts 1–5 plausibility ratings for candidate senses of homonyms in ambiguous short stories using prompting with closed-source LLMs. We evaluate three prompting strategies: zero-shot, chain-of-thought, and comparative prompting that jointly scores competing senses. We also find simple unweighted ensembling better aligns with subjective human judgments better than individual models. Our official submission ranked 4th on the leaderboard with an average score of 0.86, with post-competition experiments improving performance to 0.89.
uir-cis at SemEval-2026 Task 12: Mitigating Prior-Induced Hallucinations in Retrieval-Augmented Reasoning via Precision-Oriented Decoding
Chiyao Zhou | Zebing Wang | Kexin Deng | Yaru Zhao | Lin Deng | Binyang Li
Chiyao Zhou | Zebing Wang | Kexin Deng | Yaru Zhao | Lin Deng | Binyang Li
This paper describes our system for the SemEval-2026 Task 12 on Abductive Event Reasoning (AER). We systematically address the "over-selection" hallucination pathology in Instruction-tuned Large Language Models (LLMs), where models erroneously align distractors with semantic priors rather than retrieved evidence. Our framework utilizes a 32-billion parameter Qwen2.5 foundational model adapted via Low-Rank Adaptation (LoRA) and evaluated under a Zero-shot Chain-of-Thought (CoT) setting. To mitigate epistemic noise, we propose a Precision-Oriented Decoding (POD) strategy that couples low-temperature sampling (T=0.45) with scaled majority voting (K=9). Following a three-stage empirical evolution—from baseline diagnosis to precision optimization and ensemble analysis—our system achieved a score of 0.802 on the official test set. Our findings demonstrate that in causal reasoning tasks with strict penalization for incorrect predictions, epistemic noise suppression is strictly superior to heuristic recall compensation.
RAGthoven at SemEval-2026 Task 1: A Multi-Stage Pipeline Walks Into a Benchmark and Barely Clears the Bar
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
We present \textsc{RAGthoven}, our system for SemEval-2026 Task~1 (MuWaHaHa), Subtask~A (multilingual constrained humor generation in English, Spanish, and Chinese).\textsc{RAGthoven} decomposes creative text generation into a multi-stage large language model (LLM) pipeline (\textit{Planner}, \textit{Writer}, \textit{Reflector}, \textit{Judge}) grounded in computational humor theories (Benign Violation Theory, Script-based Semantic Theory of Humor) and iteratively refined through prompt engineering across ten experiments.In our final configuration, we augment the Planner with retrieval-augmented generation (RAG) from a curated joke corpus, seeding generation with diverse joke mechanisms.We additionally explore an agentic variant that exposes the same four pipeline stages as tool-calling agents orchestrated by a model loop with a \textsc{ConstraintAudit} checker. While it achieves full constraint compliance, human pairwise evaluation did not reveal a significant quality advantage over the simpler non-agentic baseline.\textsc{RAGthoven} achieves Rank~1 in all three languages, with the strongest result in Spanish (Elo 1182, 42 points above the Gemini~2.5~Flash baseline).However, while the system leads in raw Elo in Spanish, it shares Rank~1 with the baseline in all three languages due to overlapping confidence intervals; in English and Chinese the gap narrows further, suggesting that elaborate multi-stage prompt engineering may offer diminishing returns once a strong frontier model is in the loop.
AKCIT at SemEval-2026 Task 13: A Lightweight LightGBM Baseline for Cross-Language Detection of LLM-Generated Code
Rone Brandao Filho | Walcy Santos Rezende Rios | Lucas Neves | Jose Ricardo Fleury Oliveira | Diogo Fernandes | Arlindo Galvão Filho
Rone Brandao Filho | Walcy Santos Rezende Rios | Lucas Neves | Jose Ricardo Fleury Oliveira | Diogo Fernandes | Arlindo Galvão Filho
The widespread use of LLMs in software development has made the detection of machine-generated code a pressing challenge, particularly when models must generalize across programming languages and domains. We present a lightweight, LLM-free pipeline that combines stylometric feature extraction with a LightGBM classifier and explicitly prioritizes structural generalization over deep semantic modeling. Despite its simplicity, the method achieves a Macro F1 of 0.70–0.72, more than doubling the CodeBERT baseline (0.30) in SemEval-2026 Task 13 Subtask A, while operating without GPUs or any fine-tuning.
UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning
Ivan Kartac | Kristyna Onderkova | Jan Bronec | Zdeněk Kasner | Mateusz Lango | Ondrej Dusek
Ivan Kartac | Kristyna Onderkova | Jan Bronec | Zdeněk Kasner | Mateusz Lango | Ondrej Dusek
This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task’s main ranking metric and analyze its limitations.
SyntaxMind at SemEval-2026 Task 6: Exploring Transformers and LLMs for Unmasking Political Question Evasions
Md. Shihab Uddin Riad
Md. Shihab Uddin Riad
This paper describes our approach to Subtask 1: Clarity-level Classification in SemEval-2026 Task 6. The task focuses on determining the clarity of political responses with respect to their corresponding questions. To enhance model performance, we introduced a direct answer generation strategy as an additional input feature and applied Task-Adaptive Pre-Training (TAPT) to enhance encoder-only Transformer models with the task domain. We further explored both cross-entropy and focal loss to address potential class imbalance. Experimental results show that TAPT enhanced encoder models, particularly DeBERTa-V3-base, achieved the strongest performance, while generative small language models fine-tuned via parameter-efficient methods exhibited comparatively lower results. Our system obtained a macro-F1 score of 0.72 on the official evaluation set, ranking 24th out of 40 teams.
IIITH Boys at SemEval-2026 Task 4: StoryNet - Understanding Narrative Story Similarity through Symbolic Representations
Amol Vijayachandran | Ananth Rajesh | Siddharth Mago | Maitreya Chitale | Aparajitha Allamraju
Amol Vijayachandran | Ananth Rajesh | Siddharth Mago | Maitreya Chitale | Aparajitha Allamraju
Narrative similarity extends beyond standard semantic tasks, requiring alignment of temporal, causal, and emotional structures. We present StoryNet, a framework that represents stories as heterogeneous graphs with character, event, and theme nodes. Stories are decomposed into structured narrative facets using large language models, and similarity is evaluated through both weighted semantic facet comparison and a graph neural network trained with contrastive learning. We analyze how integrating symbolic structure with learned graph representations compares to purely embedding-based baselines.
Yam at SemEval-2026 Task 4: Failure-Driven Prompt Evolution for Narrative Comparison
Yen Yee Yam | Hong Meng Yam
Yen Yee Yam | Hong Meng Yam
We present a structured, parameter-free system for SemEval-2026 Task 4 on Narrative Story Similarity. Instead of treating similarity as scalar embedding proximity, we align model reasoning with the task ontology by decomposing each story into abstract theme, course of action, and outcome, and performing contrastive comparison over these dimensions. Our primary contribution is a closed-loop, failure-driven prompt optimization procedure that iteratively refines concise guideline documents while keeping model parameters fixed and reverting updates that degrade performance, thereby isolating improvements attributable to structured reasoning rather than representation learning. Ontology-aligned decomposition alone achieves 70% accuracy on both train and test sets; with controlled guideline evolution, performance improves to 76% on train and 73% on test without additional supervision or fine-tuning. These results demonstrate that structured prompt optimization can meaningfully enhance contrastive narrative reasoning in a fully parameter-free setting.
Pinetree at SemEval-2026 Task 7: A Large-Scale Failure Analysis of Cultural Grounding in Language Models
Yen Yee Yam | Hong Meng Yam
Yen Yee Yam | Hong Meng Yam
Using a simple prompting strategy without fine-tuning or retrieval augmentation, our system achieved 88.85% micro-average and 90.55% macro-average accuracy, ranking #4 overall on SemEval-2026 Task 7. Our primary contribution is a failure analysis of 5,241 incorrect predictions (11.15% of the dataset), categorized using the six-topic BLEnD taxonomy. Errors concentrate in Food (39.42%) and Holidays/Celebration/Leisure (15.76%), but within-topic error rates are highest on Family (21.04%) and Work life (20.45%), which topics with limited representational density. Global-brand attractor errors account for only 2.50% of failures and are tightly localized: 98.5% fall on a single template (most popular sport team) in four low-resource cultures. Outside these templates, brand-default effects are statistically negligible. These findings support representational sparsity and knowledge-density asymmetry, not ideological skew, as the dominant cause of cultural misalignment in everyday behavioral tasks.
TUCNLP at SemEval-2026 Task 11: Neuro-Symbolic Content Stripping for Debiased Syllogistic Reasoning
Rafael Butas | Alex Lapusan | Camelia Lemnaru | Rodica Potolea
Rafael Butas | Alex Lapusan | Camelia Lemnaru | Rodica Potolea
In this paper, we present the solution submitted by TUCNLP at SemEval-2026 Task~11: Disentangling Content and Formal Reasoning in Large Language Models. The task requires predicting the formal validity of categorical syllogisms while minimizing susceptibility to content-driven biases in English and 11 additional languages. We show that a modestly-sized model (Qwen3-8B) can achieve near-perfect logical reasoning on the English validity-only subtask, and large reductions in content effect on multilingual and premise-retrieval variants, when augmented with a multi-stage neuro-symbolic pipeline: LLM-based content stripping with iterative error correction converts natural language to abstract categorical forms, a classical symbolic parser validates against the twenty-four Aristotelian syllogistic forms, and asymmetric confidence thresholds mediate between symbolic and neural decisions. Across the four subtasks (ST1 to ST4), our system achieves accuracy ranging from 91.1\% to 100\% and bias-penalized ranking scores ($\mathcal{M}$) from 31.8 to 100.0, with the main bottleneck being overconfident neural predictions that bypass symbolic verification.
Truth Gradient at SemEval-2026 Task 10:Conspiracy Belief Detection via Narrative Density and Mean Pooling
Ekansh Goyal
Ekansh Goyal
Conspiracy believers use significantly more psycholinguistic markers per post than nonbelievers (Cohen’s d = 0.56, p 10⁻⁸⁰), a pattern we term narrative density, suggesting that belief manifests as structurally denser conspiratorial frames distributed across the full text rather than concentrated in specific lexical cues.We present Truth Gradient’s system for SemEval-2026 Task 10 Subtask 2 (Samory et al., 2026): a DeBERTaV3-large model with mean pooling and a 5-seed probability-averaging ensemble achieving macro F1 = 0.829 on the 77-sample development set and 0.75 on the official test set. The 5-fold CV estimate (0.734 ± 0.007) proves the more reliable predictor of test performance, and we recommend it as standard practice for low-resource shared tasks.Two convergent tests support the narrative density account: masking annotated marker spans drops F1 by 5.3 pp, and direct marker-count fusion recovers +0.9 pp, though we note these are not conclusive given the small dev set. Cross-validated ablation identifies encoder fine-tuning as the dominant design factor (−7.2 pts), and layer-wise probing confirms belief information peaks at mid-stack layers (layer 16/24).
GigitAI at SemEval-2026 Task 11: Hybrid Symbolic-Neural Approach for Syllogistic Validity Classification
Saran Krishnasamy
Saran Krishnasamy
We present our system for SemEval-2026 Task 11 on classifying whether syllogisms are logically valid. The main challenge is that language models tend to judge arguments based on whether the conclusion sounds true in the real world, rather than whether it follows logically from the premises. We evaluate direct prompting across six models (GPT-4o, GPT-5.2, o3, o3-mini, Claude Opus 4.6, Claude Sonnet 4) with three prompt strategies, finding that even the best achieves only 89.5% accuracy. Our best-performing system splits the task into two parts: GPT-4o-mini extracts the logical structure, then deterministic rules check validity, enhanced with bidirectional premise checking, predicate negation post-processing, and a targeted rule-based fallback for double negation. This achieves 98.95% accuracy on Subtask 1 (combined score 57.74) and 85.8% validity accuracy on Subtask 2. We also explore self-consistency with symbolic verification (93.1%), content abstraction, activation steering, contrastive fine-tuning, RLVR, and diffusion-based reasoning, finding that content abstraction surprisingly degrades performance, revealing that semantic content provides essential parsing scaffolding alongside the bias it introduces.
Team Evaluators at SemEval-2026 Task 6: Instruction-Tuned LLMs for Clarity and Evasion Classification in Political Interviews
Siva Nuthakki | Sanjay Pulagam | Sai Woona
Siva Nuthakki | Sanjay Pulagam | Sai Woona
This work is part of the SemEval-2026 CLARITY shared task (Task 6), which focuses on detecting clarity and evasion in political question–answer pairs from interviews and debates. The competition includes two subtasks: clarity-level classification (Clear Reply, Ambiguous,Clear Non-Reply) and evasion-level classification, which identifies one of nine fine-grained evasion techniques. The dataset consists of annotated question–answer pairs with hierarchical labels for both clarity and evasion, enabling comprehensive evaluation of nuanced discoursephenomena. We fine-tune open-source large language models using Low-Rank Adaptation (LoRA) and supervised fine-tuning (SFT), employing structured prompts that jointly encode the question and answer to capture discoursecues. Models are evaluated using Macro F1, the official metric of the shared task. Our system achieves a Macro F1 of 0.83 on Subtask 1 (5th place) and 0.54 on Subtask 2 (9th place), demonstrating that parameter-efficient fine-tuning of LLMs is effective for modeling strategic ambiguity in political discourse.
FunnyBorg at SemEval-2026 Task 1: Humor Generation
Stefan Oprea | Lacrimioara Toma Oprea | Maria Paval-Istrate | Diana Trandabat | Daniela Gifu
Stefan Oprea | Lacrimioara Toma Oprea | Maria Paval-Istrate | Diana Trandabat | Daniela Gifu
Our team competed in the SemEval-2026 Task1: MWAHAHA: Humor Generation. This isa task for generation of computational humor.The generated jokes are text-based, but alsoinclude memes, for captioning an image. Ourapproach involved prompt engineering using avoting system. We obtained rank 1 in one ofthe subtasks, and rank 2 in three other subtasks.
Habib University at SemEval-2026 Task 3: A Pipeline Approach for Dimensional Aspect-Based Sentiment Analysis
Muhammad Affan | M Hassan Shahzad | Mikaal Imam | Moiz Zulfiqar | Sandesh Kumar | Abdul Samad
Muhammad Affan | M Hassan Shahzad | Mikaal Imam | Moiz Zulfiqar | Sandesh Kumar | Abdul Samad
Aspect-based sentiment analysis has evolved from categorical polarity classification to fine-grained modeling of continuous affective dimensions. Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends this paradigm by requiring both structured sentiment extraction and continuous valence–arousal (VA) regression in multilingual settings. In this paper, we present our system for SemEval-2026 Task 3, which evaluates this challenge across six languages and four domains, requiring systems to extract aspect–category–opinion quadruplets and predict VA scores on a 1–9 scale.We propose a modular four-stage multilingual transformer pipeline for element extraction, aspect–opinion pairing, category prediction, and VA regression. We conduct experiments over multiple models and training configurations, including VA rescaling to [-1,1], Gaussian label noise injection, Concordance Correlation Coefficient (CCC) loss, and Savitzky–Golay smoothing. Among all languages, our system achieves the lowest RMSE of 0.5333 on Subtask 1 and the highest cF1 of 0.5492 on Subtask 2. We further investigate data augmentation to improve low-resource performance and address label imbalance. Ultimately, our modular architecture demonstrated highly competitive cross-lingual transfer, achieving top-tier placements in low-resource settings, including 2nd place for Tatar and 6th place for Russian in dimensional regression.
SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Hans Ole Hatzel | Ekaterina Artemova | Haimo Stiemer | Evelyn Gius | Chris Biemann
Hans Ole Hatzel | Ekaterina Artemova | Haimo Stiemer | Evelyn Gius | Chris Biemann
We present the shared task on narrative similarity and narrative representation learning — NSNRL (pronounced "nass-na-rel").The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story.We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment.Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations.We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement.This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ.We received a total of 71 final submissions from 46 teams across our two tracks.In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions.Our analysis identifies potential headroom for improvement of automated systems in both tracks.The task website includes visualizations of embeddings alongside instance-level classification results for all teams.
CophiWue at SemEval-2026 Task 4: Symbolic Narrative Profiling with Taxonomy-Guided Extraction and Contrastive Fine-Tuning
Leonard Konle | Fotis Jannidis
Leonard Konle | Fotis Jannidis
We present our system for SemEval-2026 Task 4, focusing primarily on Track B (narrative embedding). Our approach, the Decompose & Align Cycle, converts each story into a structured NarrativeProfile consisting of abstract themes, a five-step course of action, and an outcome. We then build a NarrativeTaxonomy from these initial extractions via agglomerative clustering, and use the resulting controlled vocabularies to guide a second extraction pass, producing terminologically standardized profiles across the full dataset. Finally, we contrastively fine-tune the Qwen3-Embedding8B model on profile text representations using TripletLoss, deriving story embeddings from this fine-tuned model. For Track A, we adapt the task’s provided baseline script by substituting Gemini 3 Pro as the judge, using the organizers default prompt on raw story texts.
Farhan Nafis Rayhan at SemEval-2026 Task 13: Supervised Contrastive Learning Approach with Gated Multiclass Decomposition Ensemble Architecture for Code Authorship Identification
Farhan Rayhan | Fariska Ruskanda
Farhan Rayhan | Fariska Ruskanda
This paper present our submission for SemEval-2026 Task 13 Subtask B, which requires the multi-class attribution of code snippets across 10 distinct AI generator families and a human baseline. Our proposed system utilizes a three-stage ensemble architecture specifically designed to navigate extreme class imbalance and capture subtle stylometric fingerprints. Initially, we employ Supervised Contrastive Learning to fine-tune a UniXcoder and ModernBERT backbone. Resulting embeddings are then processed by five heterogeneous shallow experts, each utilizing a multiclass decomposition to master specific generator lineages through specialized architectures. A Human Shield acts as a hierarchical safety auditor as an aggressive binary layer of human vs machine. Finally, a Context-Aware Gated Meta-Learner dynamically aggregates these expert opinions into a final predictions. Our experiments reveal that streamlining the system to a pure UniXcoder backbone fine-tuned with supervised contrastive learning improves performance, outclassing the official CodeBERT baseline with a final Macro-F1 score of 0.31389, ranking 26th overall.
CUET320 at SemEval-2026 Task 10: Few-Shot Large Language Models for Psycholinguistic Marker Extraction and Conspiracy Detection
Faozia Fariha | Lamia Khan | Madiha Ahmed Chowdhury | Kawsar Ahmed | Mohammed Moshiul Hoque
Faozia Fariha | Lamia Khan | Madiha Ahmed Chowdhury | Kawsar Ahmed | Mohammed Moshiul Hoque
Conspiracy theories widely spread on social media and can harm society by increasing mistrust, vaccine hesitancy, and political radicalization. However, most automated detection systems have traditionally relied on topic-specific classifiers, which often struggle to generalize across domains and provide little explanation for why a text is considered conspiratorial. To address these limitations, this paper explores various LLMs on the SemEval-2026 Task 10: psycholinguistic conspiracy marker extraction and binary conspiracy detection from Reddit submission statements. Specifically, we adopt a training-free few-shot prompting approach using different instruction-tuned large language models under a variety of few-shot settings (k in {0,1,5,10,15, 20}). Within this framework, the proposed prompting strategy incorporates psychology-informed instructions to guide the models in identifying conspiracy-related signals. As a result, the presented system achieves an F1 score of 0.21 for marker extraction and 0.81 for conspiracy detection, ranking 16th out of 30 teams in Subtask~1 and 36th out of 52 in Subtask~2 without any task-specific fine-tuning. These results suggest that psycholinguistically grounded prompting can support interpretable conspiracy analysis; however, challenges remain in identifying implicit markers.
UTD-HLTRI at SemEval 2026 Task 4: Reasoning like an Expert for Inferring Narrative Similarity
Rakshitha Rao Ailneni | Maitry Bhavsar | Sanda Harabagiu
Rakshitha Rao Ailneni | Maitry Bhavsar | Sanda Harabagiu
Narrative similarity is a challenging problem that requires reasoning over three aspects of narratives, including (1) the abstract theme; (2) the course of action and (3) the outcomes of narratives. We present UTD.HLTRISIM.NARRATIVES, our method developed for SemEval 2026 Task 4 (Narrative Story Similarity), which combines contrastive reasoning prompting with careful selection of few-shot examples to guide a Large Language Model(LLM) toward decisions of narrative comparative similarity. A curriculum learning framework orders examples of narrative triplets presented to the LLM by using a score that quantifies the impact of common narratives aspects with information discerned from several distractors of narrative similarity between pairs ofnarratives 1.
Team Vivek Dhayaal at SemEval-2026 Task 13 Subtask B: Multi-Class Authorship Detection
David Rodriguez | Mario Graff
David Rodriguez | Mario Graff
This paper describes the system for SemEval-2026 Task 10 Subtask 2 on conspiracy detection. We explore a progressive modeling strategy comparing traditional lexical representations with contextual transformer models. Lexical baselines include Bag-of-Words and TF-IDF features combined with Logistic Regression and Ridge classifiers. We then fine-tune a DistilRoBERTa transformer model for binary classification.All experiments were conducted using only the official task data in a CPU-only environment without external datasets or data augmentation. Our objective was to achieve acceptable performance while minimizing computational resources and model complexity. Results show that the transformer model improves the best lexical baseline from 0.67 to 0.75. The work highlights that competitive performance in conspiracy detection can be obtained with lightweight and reproducible configurations.
CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection
Christos Tzouvaras | Konstantinos Skianis | Athanasios Voulodimos
Christos Tzouvaras | Konstantinos Skianis | Athanasios Voulodimos
This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reportedscore.
ABARUAH at SemEval-2026 Task 1: Leveraging High-Resolution VLMs and Reasoning LLMs for Multimodal Humor Generation
Arup Baruah
Arup Baruah
This paper describes the systems developed for "SemEval 2026 Task 1: Humor Generation". This shared task covered both unimodal text constraints and multimodal GIF-based humor generation. The proposed approach used a two-stage pipeline consisting of a Multimodal Grounding stage to extract semantic descriptions from GIFs and a Humor Synthesis stage to generate the final humorous output. The Qwen2-VL and Qwen3-8B models were used for these respective stages. The system achieved competitive Elo-like ratings of 1009, 973, and 914 for Subtasks A, B1, and B2, respectively, demonstrating its ability to address diverse humorous constraints. The system was ranked 4th in overall standings for Subtasks A and B1.
AI@UMS at SemEval-2026 Task 6: Handling Long Question-Answer Pairs with Sliding Window Models for Clarity and Evasion Analysis
Ikhlasul Amal | Zia Ul Zafar | Choiru Firdaus | Endang Pamungkas
Ikhlasul Amal | Zia Ul Zafar | Choiru Firdaus | Endang Pamungkas
This paper presents the AI@UMS system for SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions. The task requires classifying question-answer (QA) pairs from political interviews along two dimensions: clarity level (Subtask 1) and evasion technique (Subtask 2). A key challenge is that political interview transcripts often exceed the 512-token input limit of standard transformer encoder models. We address this with a sliding-window fine-tuning strategy applied to roberta-base, where each QA pair is segmented into overlapping windows of 512 tokens with a stride of 256 tokens. Per-window predictions are aggregated via softmax probability averaging across multiple windows and across an ensemble of three independently trained models with different random seeds. We further employ class-weighted focal-inspired loss and label smoothing to mitigate the pronounced class imbalance in both subtasks. Our system achieves macro F1 scores of 0.62 (Subtask 1) and 0.48 (Subtask 2) on the official evaluation set.
GUIR at SemEval-2026 Task 7: Probing Cultural Knowledge in LLMs via Multi-Agent Debate
Reihaneh Iranmanesh | Ophir Frieder | Nazli Goharian
Reihaneh Iranmanesh | Ophir Frieder | Nazli Goharian
We present the GUIR system for SemEval-2026 Task 7, Everyday Knowledge Across Diverse Languages and Cultures, which probes the extent to which general-purpose LLMs encode cultural knowledge without any culture-specific supervision or fine-tuning. Our system addresses two tracks built on the BLEnD benchmark. For the short-answer question (SAQ) track, we employ zero-shot prompting with gpt-4.1, achieving 55.5% accuracy across 61 language locales. For the multiple-choice question (MCQ) track, we propose a three-stage pipeline: (1) zero-shot chain-of-thought inference with gpt-5-mini, (2) cross-locale majority voting to correct inconsistent predictions, and (3) a multi-agent debate protocol in which three LLM instances argue and adjudicate over residual errors. This pipeline achieves 97.47% overall accuracy across 30 locales, ranking first among all submitted systems on the MCQ track. We further conduct a targeted human evaluation on the Persian locale, revealing that BLEnD’s lemma-matching scorer systematically underestimates model performance, with human annotators scoring the system 18 percentage points higher than the lemma-matching evaluation. This reveals the need for better evaluation of morphologically rich languages like Persian.
NAMAA at SemEval-2026 Task 9: Comparing Generative, Retrieval-Augmented, and Discriminative Methods for Arabic Online Polarization Detection and Type Classification
Abdelbasset Djamai | Sahara Al-Madi | Norah Al-Zaid | Khloud Al Jallad | Mona Azim
Abdelbasset Djamai | Sahara Al-Madi | Norah Al-Zaid | Khloud Al Jallad | Mona Azim
Detecting polarization in online discourse is important for understanding social fragmentation , yet it remains difficult for Arabic due to dialect variation, informal writing, and implicit framing. In this paper, we study Arabic polarization modeling in the SemEval-2026 Task 9 (POLAR) setting, focusing on polarization detection (ST1) and polarization type classification (ST2). We compare three approaches: encoder fine-tuning, zero-shot prompting, and retrieval-augmented in-context learning (RAG-ICL), across six Arabic encoders and different LLMs. For ST1, RAG-ICL with Gemma-3-27b-it achieves the best result (test macro F1 = 0.83), while remaining competitive with the best fine-tuned encoder (0.82), and substantially outperforming zero-shot prompting. For ST2, a pipeline that first applies the best ST1 encoder as a hard filter and then performs RAG-ICL achieves a macro F1 = 0.62. Prompt-language effects are model-and task-dependent, with some settings doing better with English prompts and others with Arabic prompts. Chain-of-thought, self-refinement, and contrastive prompting do not outperform standard RAG-ICL.
MoMo at SemEval-2026 Task 9: Inference-Only Prompting vs. Fine-Tuning for Multilingual Polarization Detection
Sushant Ray | Rakshita Saksainaa
Sushant Ray | Rakshita Saksainaa
We describe our submission to SemEval-2026 Task 9 Subtask 1, which focuses on multilingual polarization detection over the POLAR dataset. We compare three adaptation paradigms: fully fine-tuned multilingual encoders, frozen encoders augmented with lightweight residual heads, and inference-only multilingual LLM prompting in zero-shot and few-shot settings. For few-shot prompting, we evaluate both random and similarity-based support example selection. Similarity-based few-shot prompting with a multilingual LLM competes with our fine-tuned encoder baselines while requiring no task-specific training. We further analyze energy usage, stability across prompt selections and per-language behavior to characterize trade-offs between architectural adaptation and prompt-based inference. While our submission uses a fully fine tuned XLM-RoBERTa Large, the results indicate that inference-only prompting can be a competitive and energy-efficient alternative to task-specific fine-tuning in multilingual classification.
Codexa at SemEval-2026 Task 13: Loss Engineering and Diverse Ensemble Strategies for Multi-Class Code Authorship Attribution
Anıl Dervişoğlu | Atakan Site
Anıl Dervişoğlu | Atakan Site
We describe our system for SemEval-2026 Task 13, Subtask B: code classification into 11 categories (human-written or generated by one of 10 LLM families). The task presents extreme class imbalance and distribution shift across multiple generators provided in the dataset (31 in training, 59 in test, with 36 unseen). On that focus, we approached with two components: (1) UniXcoder as the encoder with Label-Distribution-Aware Margin (LDAM) loss for handling class imbalance, which provides a +7% absolute improvement over the cross-entropy baseline; and (2) a diverse ensemble of 12 models trained with different objectives and architectures which is detailed in the appendix, combined with hard voting. Our system achieves 41.28% Macro F1 on the official test set. We find that loss engineering and ensemble diversity matter more than domain adaptation techniques, which consistently degraded test performance.
StanceLab at SemEval-2026 Task 9: Addressing Class Imbalance in Multilingual Polarization Detection
Teodor Ivanusca | Dan Dodun-Des-Perrieres | Stefana Gheorghita
Teodor Ivanusca | Dan Dodun-Des-Perrieres | Stefana Gheorghita
Polarization in online discourse poses significant challenges for natural language processing, particularly in multilingual and culturally diverse environments. In this paper, we address the SemEval-2026 POLAR shared task on multilingual polarization detection across 22 languages. We adopt a staged experimental strategy that first investigates the problem in a controlled monolingual English setting before extending the approach to multilingual modeling. Our system evaluates several transformer-based architectures, including RoBERTa, XLM-RoBERTa, MPNet, and mDeBERTa-v3, combined with techniques designed to mitigate class imbalance such as weighted loss functions, focal loss, and data augmentation using back-translation and large language models. Experimental results show that no single configuration consistently dominates across all languages. However, focal loss and augmentation frequently improve performance in languages with skewed label distributions. Our findings highlight the importance of contextual representations, imbalance-aware training strategies, and language-specific considerations for robust multilingual polarization detection.
CoPol at SemEval-2026 Task 9: Modeling Polarization Type Co-occurrence with Label Correlation Networks
Pushkar Arora
Pushkar Arora
POLAR-LDA is a label-dependency–aware system for SemEval-2026 Task 9 (multi-label polarization type classification) that augments an mDeBERTa-v3-base encoder with a Label Correlation Network (language-specific directed co-occurrence matrices + GAT), Asymmetric Loss tuned for extreme positive scarcity, and a language-grouped ensemble. The system scores 0.567 macro F1 across 22 languages (range 0.784 Hindi — 0.256 Italian) and shows clear ablation gains (ASL +0.041, LCN +0.030, ensemble +0.018). Key findings: absolute data voids (0–1 positive examples) form an unrecoverable floor for supervised learning; label co-occurrence is culturally situated (e.g., political↔religious in Indic vs. political↔racial in some Western languages) and benefits from language-specific graphs; and per-label training volume predicts cross-lingual performance better than linguistic family. Limitations are honest and important: noisy AL estimates under scarcity, an incoherent residual "other" category, and domain mismatch between pretraining corpora and polarization discourse. Overall, the paper offers a strong shared-task system and useful empirical diagnostics—practical and well-executed, but incrementally novel methodologicall
SemEval-2026 Task 10: PsyCoMark – Psycholinguistic Conspiracy Marker Extraction and Detection
Mattia Samory | Felix Soldner | Veronika Batzdorfer
Mattia Samory | Felix Soldner | Veronika Batzdorfer
Despite the need to address the proliferation of conspiracy theories in online discussions, there is a lack of benchmarks for effectively detecting conspiracy-related content in everyday conversational settings. We introduce a novel dataset of comments from Reddit, ranging from politics to TV series, as well as two synergetic tasks: (1) extracting five psycholinguistic markers, grounded in evolutionary psychology, and (2) detecting conspiracy content. The data enable multi-task approaches, allowing testing of whether marker extraction improves detection performance.
SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
We present the results and the main findings of SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Our task featured three subtasks. Subtask A is a binary classification taskthat determines whether a given code snippet is written by a human or generated by a machine. This subtask focuses on the development of robust methods for AI-generated code identification, since the training and the test data splits have code in different languages and cover diverse usage domains. Subtask B focuses on defining synthetic code smells and requires participants to identify the provenance of the generator family of the model that generated the given code snippet. Subtask C aims at more fine-grained attribution of the written code: whether it was fully AI-generated, fully human-written, produced in human-AI collaboration (hybrid) or by a model tuned or prompted to give human-like code. The task attracted a large number of team members: subtask A (81), subtask B (34), and subtask C (32). In this study, we present the task, analyze the results and discuss the submissions of the system and the methods they used.
SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models
Pengfei Cao | Mingxuan Yang | Yubo Chen | Chenlong Zhang | Mingxuan Liu | Kang Liu | Jun Zhao
Pengfei Cao | Mingxuan Yang | Yubo Chen | Chenlong Zhang | Mingxuan Liu | Kang Liu | Jun Zhao
Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER). The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
SemEval-2026 Task 8: MTRAGEval: Evaluating Multi-Turn RAG Conversations
Sara Rosenthal | Vraj Shah | Yannis Katsis | Marina Danilevsky
Sara Rosenthal | Vraj Shah | Yannis Katsis | Marina Danilevsky
We present the results and findings from SemEval Task 8: MTRAGEval. MTRAGEval measures three Retrieval Augmented Generation (RAG) subtasks: A. Retrieval, B. Generate, and C. Retrieve+Generate (full RAG) on multi-turn conversations. The task is evaluated using MTRAG-UN, a new benchmark for Multi-Turn RAG focusing on Unanswerable, Underspecified, Non-Standalone, and Unclear Questions. The MTRAGEval task attracted strong participation with 107 registered teams and 92 submissions across all subtasks, and yielded several interesting findings on effective retrieval and query rewriting techniques, the use of ensemble models, and the compounding costs of retrieval errors on downstream generation quality.
SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Janosch Gehring | Selina Meyer | Michael Roth
Janosch Gehring | Selina Meyer | Michael Roth
We introduce SemEval-2026 Task 5 on "Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding". The dataset for this task consists of 4-5 sentence English short stories. In each story, one sentence includes a lexical ambiguity and different senses are to be judged in terms of plausibility on a Likert scale. The task is intentionally constructed to be challenging by stories only providing sparse contextual cues. We give an overview of well-performing, frequent and interesting approaches used by participating systems. From a total of 175 registered participants and 27 submitted system description papers, the best system achieved an "accuracy within standard deviation" score of 93.3%.
SemEval-2026 Task 6: CLARITY – Unmasking Political Question Evasions
Konstantinos Thomas | Giorgos Filandrianos | Maria Lymperaiou | Chrysoula Zerva | Giorgos Stamou
Konstantinos Thomas | Giorgos Filandrianos | Maria Lymperaiou | Chrysoula Zerva | Giorgos Stamou
This paper presents CLARITY, the SemEval-2026 shared task on detecting and classifying evasive responses in political discourse. The task is grounded in an expert-designed two-level taxonomy and a benchmark dataset of question-answer pairs from U.S. presidential interviews, requiring systems to distinguish clear from evasive responses at a coarse level and identify one of nine fine-grained evasion strategies at a fine-grained level. With 124 registered teams and over 1,400 combined valid submissions, the task attracted broad participation spanning a wide range of methodological approaches, from fine-tuned encoder models to multi-stage large language model pipelines. Analysis of submitted systems reveals that hierarchical exploitation of the taxonomy and chain-of-thought prompted LLMs were the most effective strategies, while fine-grained evasion classification remained a substantially harder and largely unsolved challenge. CLARITY advances the study of strategic ambiguity in political language as a formal NLP benchmark and highlights key open problems in computational discourse analysis.
SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models
Marco Valentino | Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | André Freitas
Marco Valentino | Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | André Freitas
SemEval-2026 Task 11 evaluates the ability of Large Language Models (LLMs) to perform content-independent reasoning through a novel multilingual syllogistic dataset designed to measure the "content effect" — the tendency to conflate semantic plausibility with logical validity. The competition featured four subtasks, covering English and multilingual settings with both standard and noisy premise sets. Evaluations of zero-shot baselines reveal that the content effect is pervasive in open models, whereas newer versions demonstrate a significant shift in performance. Across the subtasks, findings indicate that introducing distracting premises can challenge the models’ ability to filter misleading information, while multilingual settings amplify their susceptibility to content biases compared to English. Participants proposed diverse approaches, including neuro-symbolic decomposition, fine-tuning and distillation methods, data augmentation, and activation steering. While explicit symbolic verification remains the most reliable strategy, activation-level interventions and fine-tuning methods offer promising pathways for internalising formal logic within neural architectures. Ultimately, the task reinforces the efficacy of neuro-symbolic approaches and emerging architectural trends for logical reliability, while also highlighting that multilingual setups and longer contexts still pose significant challenges to be investigated in future research.
SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Nikita Soni | H. Andrew Schwartz | Ryan Boyd | Phi Long Bui | Syeda Mahwish | August Nilsson | Adithya V Ganesan | Lyle Ungar | Niranjan Balasubramanian | Saif Mohammad
Nikita Soni | H. Andrew Schwartz | Ryan Boyd | Phi Long Bui | Syeda Mahwish | August Nilsson | Adithya V Ganesan | Lyle Ungar | Niranjan Balasubramanian | Saif Mohammad
We present our shared task on predicting variation in emotional valence and arousal over time from ecological essays. The shared task uses a longitudinal dataset collected over 7 data collection phases of 14-day each spanning from 2021 to 2024, consisting of real-time essays and feeling words (e.g., happy, calm, sad, etc.) written in English by U.S. service-industry workers about “how they are feeling”. Each text is associated with self-reported valence (V) (0 - 4, highly negative to highly positive affect) and arousal (A) (0 - 2, low to high energy) scores. The shared task consists of three parts, Subtask (1): Longitudinal Affect Assessment, Subtask (2): Forecasting Variation in Affect as a (2a): \textit{state change}, and (2b): \textit{disposition change}.The task attracted over 200 member registrations on Codabench, receiving official system submissions from 31 teams (total 104 team members), of which 28 teams (with 90 team members) submitted system description papers making it to our leaderboard. We discuss baseline results along with findings from 28 systems, highlighting the best-performing systems, a deeper analysis of performance on essays versus feeling words, and assessments for authors seen versus unseen during training. The datasets for this task are publicly available.
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
Liang-Chih Yu | Jonas Becker | Shamsuddeen Hassan Muhammad | Idris Abdulmumin | Lung-Hao Lee | Ying-Lung Lin | Jin Wang | Jan Philip Wahle | Terry Lima Ruas | Natalia Loukachevitch | Alexander Panchenko | Ilseyar Alimova | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Bela Gipp | Kai-Wei Chang | Saif Mohammad
Liang-Chih Yu | Jonas Becker | Shamsuddeen Hassan Muhammad | Idris Abdulmumin | Lung-Hao Lee | Ying-Lung Lin | Jin Wang | Jan Philip Wahle | Terry Lima Ruas | Natalia Loukachevitch | Alexander Panchenko | Ilseyar Alimova | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Bela Gipp | Kai-Wei Chang | Saif Mohammad
We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence–arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression.The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.
SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Usman Naseem | Robert Geislinger | Ada Ren | Sarah Kohail | Rudy Garrido Veliz | P Sam Sahil | Yiran Zhang | Marco Antonio Stranisci | Idris Abdulmumin | Özge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Elena Tutubalina | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Tanmoy Chakraborty | Dheeraj Kodati | Sahar Moradizeyveh | Firoj Alam | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Clemencia Siro | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
Usman Naseem | Robert Geislinger | Ada Ren | Sarah Kohail | Rudy Garrido Veliz | P Sam Sahil | Yiran Zhang | Marco Antonio Stranisci | Idris Abdulmumin | Özge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Elena Tutubalina | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Tanmoy Chakraborty | Dheeraj Kodati | Sahar Moradizeyveh | Firoj Alam | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Clemencia Siro | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three subtasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submissions on Codabench. We received final submissions from 67 teams and 69 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset and other resources for this task are publicly available.
SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
We present SemEval-2026 Task 1: MWAHAHA (Models Write Automatic Humor And Humans Annotate), the first shared task on general-purpose humor generation. Systems must produce short jokes in English, Spanish, and Chinese under lexical or topical constraints (Subtask A) and generate humorous captions for GIFs (Subtask B). To discourage memorization and ensure fairness, all jokes must meet specific criteria, such as using infrequent word pairs or relating to recent news headlines. Evaluation is conducted through pairwise human preference judgments in a Chatbot Arena-style setting, yielding Elo-based rankings. The task attracted 309 registered users, with 37 teams submitting systems to the evaluation phase. Participating systems employ a wide range of NLP techniques, including generate-then-rank pipelines, reinforcement learning, parameter-efficient fine-tuning, retrieval-augmented generation, humor-theory-grounded prompting, and persona-based strategies. Our Gemini 2.5 Flash baseline, using simple prompts, tied for first place in all subtasks, and the majority of elaborate multi-stage pipelines only marginally surpassed it with overlapping confidence intervals. More work is necessary to outperform the simple usage of state-of-the-art large language models. We release all evaluation data, prompts, and leaderboard results to support future research in computational humor generation.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
Nedjma Ousidhoum | Junho Myung | Carla Perez Almendros | Jiho Jin | Amr Keleg | Meriem Beloucif | Yi Zhou | Rodrigo Agerri | Vladimir Araujo | Naomi Baes | James Barry | Joanne Boisson | Nancy Chen | Christine De Kock
Nedjma Ousidhoum | Junho Myung | Carla Perez Almendros | Jiho Jin | Amr Keleg | Meriem Beloucif | Yi Zhou | Rodrigo Agerri | Vladimir Araujo | Naomi Baes | James Barry | Joanne Boisson | Nancy Chen | Christine De Kock
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al., 2024), covering more than 30 language–culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification.Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and we received final submissions from 62 teams, along with 19 system description papers.We report the results and present an analysis of the best-performing systems and the most commonly adopted approaches. Furthermore, we discuss shared insights into open questions and challenges related to evaluation, misalignment, and methodological perspectives on model behaviour in low-resource languages and for under-represented cultures. Our data and resources are available at https://github.com/BLEnD-SemEval2026/SemEval-2026-Task-7.