Riccardo Coppola


2026

The growing use of large language models for code generation makes distinguishing machine-generated code from human-written code increasingly difficult, especially under distribution shifts in language, domain, and generator family. SemEval-2026 Task 13 targets this challenge through three subtasks: binary detection, multi-class authorship attribution, and hybrid/adversarial code detection.In this paper, we conduct an empirical study across all subtasks, comparing a variety of approaches: frozen encoder representations, feature-based classifiers, fine-tuned transformer models, post-hoc calibration, and probability-level ensembling. Our results show a consistent generalisation gap: strong in-domain validation scores substantially overestimate performance on shifted test conditions.The code is available at https://github.com/AlexandraElena-Holota/SemEval-2026-Task13.git
Humor generation presents significant challenges due to subjectivity and the limitations of automatic metrics. In this work, we address Task 1 of SemEval 2026 (Subtask A) by evaluating three instruction-tuned models (Llama 3.1, Gemma 2, and Qwen 2.5) via a round-robin LLM judging framework. We investigate the impact of Retrieval-Augmented Generation and Direct Preference Optimization (DPO) on performance. Our results identify Llama 3.1 as the strongest baseline and demonstrate that DPO consistently improves humor quality across configurations. These findings confirm the efficacy of LLM-based judging as a practical training signal for optimizing subjective generation tasks.
While large language models (LLMs) excel at semantic reasoning, their discrete token-based outputs introduce limitations for fine-grained regression tasks requiring continuous scoring. We address graded word-sense plausibility estimation by reformulating it as a Natural Language Inference (NLI) regression problem, adapting DeBERTa-v3-large with NLI pretraining and a regression head to predict continuous plausibility scores from story-sense pairs. We compare this model against BERT, vanilla DeBERTa, SmolLM variants and state-of-the art LLMs under various prompting strategies, and show that the NLI-finetuned model achieves superior rank correlation and alignment with human judgments. While several baselines collapse toward mean predictions and LLMs show unstable prompting sensitivity, our findings establish NLI-informed pretraining as highly effective for narrative plausibility regression, highlighting fundamental LLM limitations for word sense disambiguation.
Online polarization has become a central challenge in digital discourse, characterized by hostility, identity-based division, and culturally dependent expressions that vary across languages. Automatically detecting such phenomena is particularly difficult in multilingual settings, where semantic nuance and implicit rhetoric complicate cross-lingual generalization.In this context, we participate in POLAR, a shared task at SemEval 2026 on multilingual polarization detection and categorization across 22 languages. We compare three modeling paradigms: multilingual encoder fine-tuning, translation-based transfer learning, and prompting-based generative reasoning. For the multi-label categorization task, we introduce a two-stage cascaded architecture to mitigate false positives under severe class imbalance.Our results show that multilingual encoders achieve the most robust performance for binary detection, whereas reasoning-based prompting is competitive for fine-grained category classification. This comparative study highlights the strengths and limitations of each paradigm for cross-lingual polarization analysis.
Recently, Retrieval-Augmented Generation (RAG) has become a significant task in Large Language Models (LLMs). In multi-turn RAG, a good system must overcome the challenges of maintaining context as the dialogue turns progress and manage the issue of generating answers based on conversation history. In this work, we address the MTRAGEval task 8 at SemEval-2026, by presenting a high-performance, parallelised Multi-Turn RAG pipeline designed to address three subtasks: Retrieval (Subtask A), Generation (Subtask B), and End-to-End RAG (Subtask C). Our methodology utilises a Streamlit framework that allows users to embed diverse corpora with varying vector spaces and embedding models, facilitating configuration for each task based on its nature. Some key experiments focus on the performance of different vector databases and embedding models, the necessity of LLM-based query rewriting (QR) for non-standalone questions, the use of different rerankers, and the scale and performance of the selected LLM for answer generation. We conclude that a configuration utilising query rewriting along with reranking delivers the best results. The code is available on GitHub https://github.com/merttoprak1/MTRAGEval-Evaluating-Multi-Turn-RAG-Conversations.
Longitudinal modelling of affect from text requires capturing both linguistic content and temporal emotional dynamics. SemEval-2026 Task 2 introduces a dataset of essays and feeling words annotated with self-reported valence and arousal scores. In this work, we propose a neural architecture that combines pretrained Transformer encoders with temporal sequence modelling to predict continuous valence and arousal over user-specific timelines. Individual texts are encoded using a Transformer-based language model and aggregated through attention-based pooling before being processed by recurrent layers to capture longitudinal dependencies. To adapt pretrained representations under limited data conditions, we explore parameter-efficient fine-tuning strategies. We make the code available at https://github.com/AndreaLolli2912/SemEval2026-EmoVA.