Claudio Savelli


2026

The growing use of large language models for code generation makes distinguishing machine-generated code from human-written code increasingly difficult, especially under distribution shifts in language, domain, and generator family. SemEval-2026 Task 13 targets this challenge through three subtasks: binary detection, multi-class authorship attribution, and hybrid/adversarial code detection.In this paper, we conduct an empirical study across all subtasks, comparing a variety of approaches: frozen encoder representations, feature-based classifiers, fine-tuned transformer models, post-hoc calibration, and probability-level ensembling. Our results show a consistent generalisation gap: strong in-domain validation scores substantially overestimate performance on shifted test conditions.The code is available at https://github.com/AlexandraElena-Holota/SemEval-2026-Task13.git
Humor generation presents significant challenges due to subjectivity and the limitations of automatic metrics. In this work, we address Task 1 of SemEval 2026 (Subtask A) by evaluating three instruction-tuned models (Llama 3.1, Gemma 2, and Qwen 2.5) via a round-robin LLM judging framework. We investigate the impact of Retrieval-Augmented Generation and Direct Preference Optimization (DPO) on performance. Our results identify Llama 3.1 as the strongest baseline and demonstrate that DPO consistently improves humor quality across configurations. These findings confirm the efficacy of LLM-based judging as a practical training signal for optimizing subjective generation tasks.
Online polarization has become a central challenge in digital discourse, characterized by hostility, identity-based division, and culturally dependent expressions that vary across languages. Automatically detecting such phenomena is particularly difficult in multilingual settings, where semantic nuance and implicit rhetoric complicate cross-lingual generalization.In this context, we participate in POLAR, a shared task at SemEval 2026 on multilingual polarization detection and categorization across 22 languages. We compare three modeling paradigms: multilingual encoder fine-tuning, translation-based transfer learning, and prompting-based generative reasoning. For the multi-label categorization task, we introduce a two-stage cascaded architecture to mitigate false positives under severe class imbalance.Our results show that multilingual encoders achieve the most robust performance for binary detection, whereas reasoning-based prompting is competitive for fine-grained category classification. This comparative study highlights the strengths and limitations of each paradigm for cross-lingual polarization analysis.
The rapid advancement of Large Language Models (LLMs) has significantly impacted software engineering, posing challenges for determining the origin and authenticity of source code. This paper presents the MALTO team’s submission for SemEval-2026 Task 13, explicitly focusing on Subtask B (Authorship Attribution among 11 classes) and Subtask C (Hybrid Code Detection). To address severe class imbalance and the complex boundaries of mixed human-machine code, we propose a unified framework that leverages an ensemble of UniXcoder and CodeT5. Our approach integrates a robust Tree-sitter-based Universal Canonicalization strategy, Data Augmentation, and a novel 3-Phase Curriculum Training schedule enhanced by Hard Negative Mining. Specifically, UniXcoder’s cross-modal representations excel at distinguishing among semantically overlapping LLM families (Subtask B), whereas CodeT5’s identifier-aware architecture is superior at detecting subtle structural anomalies in hybrid and adversarial snippets (Subtask C). By aggregating these complementary strengths, our soft-voting ensemble overcomes the limitations of individual models, demonstrating strong robustness against imbalanced distributions and effectively discriminating between purely human, purely machine, hybrid, and adversarial code snippets.

2025

Large language models (LLMs) may retain and reproduce sensitive information learned during training, posing significant privacy and ethical concerns. Once detected, this personal information should be deleted from the model. A naive answer could be to retrain these models from scratch when needed. However, this solution is unfeasible given the immense computational, economic, and environmental costs required to train these models. For this reason, Machine Unlearning (MU) has risen in recent years as an emerging field of research to efficiently delete specific information from a model’s knowledge. This paper presents our solution to the “Unlearning sensitive content from Large Language Models” shared task at SemEval-2025, which challenges researchers to develop effective LLM MU techniques. We adopt a Dual-Teacher framework that leverages a Competent and an Incompetent Teacher to erase unwanted information while selectively preserving model utility. Our approach adapts established computer vision unlearning methods to the sequential nature of language models through KL divergence minimization over next-token prediction probabilities. Our experimental results demonstrate that our method outperforms the state-of-the-art techniques.
Large language models (LLMs) often produce {textit{hallucinations}} —factually incorrect statements that appear highly persuasive. These errors pose risks in fields like healthcare, law, and journalism. This paper presents our approach to the Mu-SHROOM shared task at SemEval 2025, which challenges researchers to detect hallucination spans in LLM outputs. We introduce a new method that combines probability-based analysis with Natural Language Inference to evaluate hallucinations at the word level. Our technique aims to better align with human judgments while working independently of the underlying model. Our experimental results demonstrate the effectiveness of this method compared to existing baselines.

2024

In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges, such as generating fluent yet inaccurate outputs and reliance on fluency-centric metrics. This often leads to neural networks exhibiting “hallucinations.” The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text. To tackle these issues, we introduce two key components, a data augmentation pipeline incorporating LLM-assisted pseudo-labelling and sentence rephrasing, and a voting ensemble from three models pre-trained on Natural Language Inference (NLI) tasks and fine-tuned on diverse datasets.