Flavio Giobergia


2026

The growing use of large language models for code generation makes distinguishing machine-generated code from human-written code increasingly difficult, especially under distribution shifts in language, domain, and generator family. SemEval-2026 Task 13 targets this challenge through three subtasks: binary detection, multi-class authorship attribution, and hybrid/adversarial code detection.In this paper, we conduct an empirical study across all subtasks, comparing a variety of approaches: frozen encoder representations, feature-based classifiers, fine-tuned transformer models, post-hoc calibration, and probability-level ensembling. Our results show a consistent generalisation gap: strong in-domain validation scores substantially overestimate performance on shifted test conditions.The code is available at https://github.com/AlexandraElena-Holota/SemEval-2026-Task13.git
Humor generation presents significant challenges due to subjectivity and the limitations of automatic metrics. In this work, we address Task 1 of SemEval 2026 (Subtask A) by evaluating three instruction-tuned models (Llama 3.1, Gemma 2, and Qwen 2.5) via a round-robin LLM judging framework. We investigate the impact of Retrieval-Augmented Generation and Direct Preference Optimization (DPO) on performance. Our results identify Llama 3.1 as the strongest baseline and demonstrate that DPO consistently improves humor quality across configurations. These findings confirm the efficacy of LLM-based judging as a practical training signal for optimizing subjective generation tasks.
While large language models (LLMs) excel at semantic reasoning, their discrete token-based outputs introduce limitations for fine-grained regression tasks requiring continuous scoring. We address graded word-sense plausibility estimation by reformulating it as a Natural Language Inference (NLI) regression problem, adapting DeBERTa-v3-large with NLI pretraining and a regression head to predict continuous plausibility scores from story-sense pairs. We compare this model against BERT, vanilla DeBERTa, SmolLM variants and state-of-the art LLMs under various prompting strategies, and show that the NLI-finetuned model achieves superior rank correlation and alignment with human judgments. While several baselines collapse toward mean predictions and LLMs show unstable prompting sensitivity, our findings establish NLI-informed pretraining as highly effective for narrative plausibility regression, highlighting fundamental LLM limitations for word sense disambiguation.
Online polarization has become a central challenge in digital discourse, characterized by hostility, identity-based division, and culturally dependent expressions that vary across languages. Automatically detecting such phenomena is particularly difficult in multilingual settings, where semantic nuance and implicit rhetoric complicate cross-lingual generalization.In this context, we participate in POLAR, a shared task at SemEval 2026 on multilingual polarization detection and categorization across 22 languages. We compare three modeling paradigms: multilingual encoder fine-tuning, translation-based transfer learning, and prompting-based generative reasoning. For the multi-label categorization task, we introduce a two-stage cascaded architecture to mitigate false positives under severe class imbalance.Our results show that multilingual encoders achieve the most robust performance for binary detection, whereas reasoning-based prompting is competitive for fine-grained category classification. This comparative study highlights the strengths and limitations of each paradigm for cross-lingual polarization analysis.
Recently, Retrieval-Augmented Generation (RAG) has become a significant task in Large Language Models (LLMs). In multi-turn RAG, a good system must overcome the challenges of maintaining context as the dialogue turns progress and manage the issue of generating answers based on conversation history. In this work, we address the MTRAGEval task 8 at SemEval-2026, by presenting a high-performance, parallelised Multi-Turn RAG pipeline designed to address three subtasks: Retrieval (Subtask A), Generation (Subtask B), and End-to-End RAG (Subtask C). Our methodology utilises a Streamlit framework that allows users to embed diverse corpora with varying vector spaces and embedding models, facilitating configuration for each task based on its nature. Some key experiments focus on the performance of different vector databases and embedding models, the necessity of LLM-based query rewriting (QR) for non-standalone questions, the use of different rerankers, and the scale and performance of the selected LLM for answer generation. We conclude that a configuration utilising query rewriting along with reranking delivers the best results. The code is available on GitHub https://github.com/merttoprak1/MTRAGEval-Evaluating-Multi-Turn-RAG-Conversations.
The rapid advancement of Large Language Models (LLMs) has significantly impacted software engineering, posing challenges for determining the origin and authenticity of source code. This paper presents the MALTO team’s submission for SemEval-2026 Task 13, explicitly focusing on Subtask B (Authorship Attribution among 11 classes) and Subtask C (Hybrid Code Detection). To address severe class imbalance and the complex boundaries of mixed human-machine code, we propose a unified framework that leverages an ensemble of UniXcoder and CodeT5. Our approach integrates a robust Tree-sitter-based Universal Canonicalization strategy, Data Augmentation, and a novel 3-Phase Curriculum Training schedule enhanced by Hard Negative Mining. Specifically, UniXcoder’s cross-modal representations excel at distinguishing among semantically overlapping LLM families (Subtask B), whereas CodeT5’s identifier-aware architecture is superior at detecting subtle structural anomalies in hybrid and adversarial snippets (Subtask C). By aggregating these complementary strengths, our soft-voting ensemble overcomes the limitations of individual models, demonstrating strong robustness against imbalanced distributions and effectively discriminating between purely human, purely machine, hybrid, and adversarial code snippets.
Longitudinal modelling of affect from text requires capturing both linguistic content and temporal emotional dynamics. SemEval-2026 Task 2 introduces a dataset of essays and feeling words annotated with self-reported valence and arousal scores. In this work, we propose a neural architecture that combines pretrained Transformer encoders with temporal sequence modelling to predict continuous valence and arousal over user-specific timelines. Individual texts are encoded using a Transformer-based language model and aggregated through attention-based pooling before being processed by recurrent layers to capture longitudinal dependencies. To adapt pretrained representations under limited data conditions, we explore parameter-efficient fine-tuning strategies. We make the code available at https://github.com/AndreaLolli2912/SemEval2026-EmoVA.

2025

The growing capabilities of Large Language Models (LLMs) have opened up new opportunities for answering questions based on structured data. However, LLMs often struggle to directly handle tabular data and provide accurate, grounded answers. This paper addresses the challenge of Question Answering (QA) over tabular data, specifically in the context of SemEval-2025 Task 8. We propose an LLM-based pipeline that generates SQL queries to extract answers from tabular datasets. Our system leverages In-Context Learning to produce queries, which are then executed on structured tables, to produce the final answers. We demonstrate that our solution performs effectively in a few-shot setup and scales well across tables of different sizes. Additionally, we conduct a data-driven error analysis to highlight scenarios where the model encounters difficulties. We make the code available at https://github.com/fgiobergia/SemEval2025-Task8.
Food safety is a critical concern: hazardous incident reports need to be classified to be able to take appropriate measures in a timely manner. The SemEval-2025 Task 9 on Food Hazard Detection aims to classify food-related incident reports by identifying both the type of hazard and the product involved, at both coarse and fine levels of granularity. In this paper, we present our solution that approaches the problem by leveraging two independent encoder-only transformer models, each fine-tuned separately to classify hazards and food products, at the two levels of granularity of interest. Experimental results show that our approach effectively addresses the classification task, achieving high-quality performance on both subtasks. We additionally include a discussion on potential improvements for future iterations, and a brief description of failed attempts. We make the code available at https://github.com/fgiobergia/SemEval2025-Task9.
Large language models (LLMs) may retain and reproduce sensitive information learned during training, posing significant privacy and ethical concerns. Once detected, this personal information should be deleted from the model. A naive answer could be to retrain these models from scratch when needed. However, this solution is unfeasible given the immense computational, economic, and environmental costs required to train these models. For this reason, Machine Unlearning (MU) has risen in recent years as an emerging field of research to efficiently delete specific information from a model’s knowledge. This paper presents our solution to the “Unlearning sensitive content from Large Language Models” shared task at SemEval-2025, which challenges researchers to develop effective LLM MU techniques. We adopt a Dual-Teacher framework that leverages a Competent and an Incompetent Teacher to erase unwanted information while selectively preserving model utility. Our approach adapts established computer vision unlearning methods to the sequential nature of language models through KL divergence minimization over next-token prediction probabilities. Our experimental results demonstrate that our method outperforms the state-of-the-art techniques.
Large language models (LLMs) often produce {textit{hallucinations}} —factually incorrect statements that appear highly persuasive. These errors pose risks in fields like healthcare, law, and journalism. This paper presents our approach to the Mu-SHROOM shared task at SemEval 2025, which challenges researchers to detect hallucination spans in LLM outputs. We introduce a new method that combines probability-based analysis with Natural Language Inference to evaluate hallucinations at the word level. Our technique aims to better align with human judgments while working independently of the underlying model. Our experimental results demonstrate the effectiveness of this method compared to existing baselines.

2024

In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges, such as generating fluent yet inaccurate outputs and reliance on fluency-centric metrics. This often leads to neural networks exhibiting “hallucinations.” The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text. To tackle these issues, we introduce two key components, a data augmentation pipeline incorporating LLM-assisted pseudo-labelling and sentence rephrasing, and a voting ensemble from three models pre-trained on Natural Language Inference (NLI) tasks and fine-tuned on diverse datasets.