Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri (Editors)

Anthology ID:: 2025.semeval-1
Month:: July
Year:: 2025
Address:: Vienna, Austria
Venues:: SemEval | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1/
DOI:
ISBN:: 979-8-89176-273-2
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.semeval-1.pdf

pdf bib
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Sara Rosenthal | Aiala Rosá | Debanjan Ghosh | Marcos Zampieri

pdf bib abs
VerbaNexAI at SemEval-2025 Task 9: Advances and Challenges in the Automatic Detection of Food Hazards
Andrea Menco Tovar | Juan Martinez Santos | Edwin Puertas

Ensuring food safety requires effective detection of potential hazards in food products. This paper presents the participation of VerbaNexAI in the SemEval-2025 Task 9 challenge, which focuses on the automatic identification and classification of food hazards from descriptive texts. Our approach employs a machine learning-based strategy, leveraging a Random Forest classifier combined with TF-IDF vectorization and character n-grams (n=2-5) to enhance linguistic pattern recognition. The system achieved competitive performance in hazard and product classification tasks, obtaining notable macro and micro F1 scores. However, we identified challenges such as handling underrepresented categories and improving generalization in multilingual contexts. Our findings highlight the need to refine preprocessing techniques and model architectures to enhance food hazard detection. We made the source code publicly available to encourage reproducibility and collaboration in future research.

pdf bib abs
REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models
Donggeon Lee | Hwanjo Yu

REFIND is a retrieval-augmented framework for detecting hallucinated spans in LLM outputs by leveraging retrieved documents. It introduces Context Sensitivity Ratio, a metric quantifying LLM sensitivity to evidence. REFIND outperforms baselines across nine languages, including low-resource settings, achieving superior hallucination detection accuracy. These results demonstrate the effectiveness of context sensitivity quantification in improving hallucination detection.

pdf bib abs
CTYUN-AI at SemEval-2025 Task 1: Learning to Rank for Idiomatic Expressions
Yuming Fan | Dongming Yang | Zefeng Cai | Binghuai Lin

We propose a multimodal framework integrating textual context and image caption analysis via systematic data augmentation and parameter-efficient fine-tuning. Our approach features: (1) option shuffling to eliminate positional bias, (2) lexical augmentation through synonym replacement and back-translation, and (3) optimized cross-modal ranking adaptation. The system ranks first in Portuguese (Top-1 Acc: 0.92) and second in English (Top-1 Acc: 0.87) on CodaBench. Experiments across 7B-72B models reveal 32B architectures achieve optimal capacity-trainability balance, while larger 72B models suffer from overfitting. Results demonstrate the limitations of GPT-4 knowledge distillation and emphasize controlled data augmentation for idiomatic language learning, advancing multimodal figurative language processing techniques.

pdf bib abs
JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models
Jieying Xue | Phuong Nguyen | Minh Nguyen | Xin Liu

With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area.This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity.To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the Base method, which maps an input directly to all its corresponding emotion labels, and the Pairwise method, which models the relationship between the input text and each emotion category individually.Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi language. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness.

pdf bib abs
YNU-HPCC at SemEval-2025 Task3: Leveraging Zero-Shot Learning for Halluciantion Detection
Shen Chen | Jin Wang | Xuejie Zhang

This study reports the YNU-HPCC team’s participation in SemEval-2025 shared task 3, which focuses on detecting hallucination spans in multilingual instruction-tuned LLM outputs. This task differs from typical hallucination detection tasks in that it does not require identifying the entire response or pinpointing which sentences contain hallucinations generated by the LLM. Instead, the task focuses on detecting hallucinations at the character level. In addition, this task differs from typical hallucination detection based on binary classification. It requires not only identifying hallucinations but also assigning a likelihood score to indicate how likely each part of the model output is hallucinatory. Our approach combines Retrieval-Augmented Generation (RAG) and zero-shot methods, guiding LLMs to detect and extract hallucination spans using external knowledge. The proposed system achieved first place in Chinese and fifteenth place in English for track3.

pdf bib abs
AtlasIA at SemEval-2025 Task 11: FastText-Based Emotion Detection in Moroccan Arabic for Low-Resource Settings
Abdeljalil El Majjodi | Imane Momayiz | Nouamane Tazi

This study addresses multi-label emotion classification in Moroccan Arabic. We developeda lightweight computational approach to detect and categorize emotional content in sevendistinct categories: anger, fear, joy, disgust,sadness, surprise, and neutral. Our findings reveal that our efficient, subword-aware modelachieves 46.44% accuracy on the task, demonstrating the viability of lightweight approachesfor emotion recognition in under-resourcedlanguage variants. The model’s performance,while modest, establishes a baseline for emotion detection in Moroccan Arabic, highlighting both the potential and challenges of applying computationally efficient architectures to dialectal Arabic processing. Our analysis revealsparticular strengths in handling morphologicalvariations and out-of-vocabulary words, thoughchallenges persist in managing code-switchingand subtle emotional distinctions. These results offer valuable insights into the trade-offsbetween speed and accuracy in multilingualemotion detection systems, particularly for low-resource languages.

pdf bib abs
Irapuarani at SemEval-2025 Task 10: Evaluating Strategies Combining Small and Large Language Models for Multilingual Narrative Detection
Gabriel Assis | Lívia De Azevedo | Joao De Moraes | Laura Ribeiro | Aline Paes

This paper presents the Irapuarani team’s participation in SemEval-2025 Task 10, Subtask 2, which focuses on hierarchical multi-label classification of narratives from online news articles. We explored three distinct strategies: (1) a direct classification approach using a multilingual Small Language Model (SLM), disregarding the hierarchical structure; (2) a translation-based strategy where texts from multiple languages were translated into a single language using a Large Language Model (LLM), followed by classification with a monolingual SLM; and (3) a hybrid strategy leveraging an SLM to filter domains and an LLM to assign labels while accounting for the hierarchy. We conducted experiments on datasets in all available languages, namely Bulgarian, English, Hindi, Portuguese and Russian. Our results show that Strategy 2 is the most generalizable across languages, achieving test set rankings of 21st in English, 9th in Portuguese and Russian, 7th in Bulgarian, and 10th in Hindi.

pdf bib abs
Domain_adaptation at SemEval-2025 Task 11: Adversarial Domain Adaptation for Text-based Emotion Recognition
Mikhail Lepekhin | Serge Sharoff

We report our participation in the SemEval-2025 shared task on classification of emotions and describe our solutions with BERT-based models and their ensembles. We participate in tracks A and B. We apply and compare base XLM-RoBERTa, Adversarial Domain Adaptation (ADA) on the XLM-RoBERTa with text length as the adversarial feature. As a simple baseline we also use a Logistic Regression based on tf-idf features. We show that the usage of ADA increases the f1 macro score on the low-resource languages, and on the texts of lower length. Besides, we describe our approach to tracks A and C where we use ADA with the text language as the confounder. We show that for some languages it helps to improve the f1 score. In all the tracks we work with the following languages: Russian, Amharic, Algerian Arabic, German, English, Spanish, Hausa, Brasilian Portuguese, Romanian, Ukrainian.

pdf bib abs
IRNLP at SemEval-2025 Task 10: Multilingual Narrative Characterization and Classification
Panagiotis Kiousis

Our system approach for multilingual narrative classification is basically based on XLM-RoBERTa Large and other bert-based models(e.g DeepPavlov, Neuralmind BERT), fine-tuned on different language datasets. To improve generalization and ensure robust performance across languages, we employed a repeated k-fold cross-validation strategy. This allowed us to maximize the use of available training data while mitigating potential overfitting issues. Our preprocessing pipeline included (1) language-specific tokenization, (2) hierarchical label structuring, and (3) dynamic batch sampling to balance label distributions. We optimized the model using the F1 macro and F1 samples metrics ,ensuring that the system’s predictions were well-calibrated for fine-grained multilingual classification. The results demonstrated that our approach effectively leveraged transformer-based architectures to model complex narrative structures across languages, with strong performance gains due to repeated k-fold evaluation.

pdf bib abs
NotMyNarrative at SemEval-2025 Task 10: Do Narrative Features Share Across Languages in Multilingual Encoder Models?
Geraud Faye | Guillaume Gadek | Wassila Ouerdane | Celine Hudelot | Sylvain Gatepaille

Narratives are a new tool to propagate ideas that are sometimes well hidden in press articles. The SemEval-2025 Task 10 focuses on detecting and extracting such narratives in multiple languages. In this paper, we explore the capabilities of encoder-based language models to classify texts according to the narrative they contain. We show that multilingual encoders outperform monolingual models on this dataset, which is challenging due to the small number of samples per class per language. We perform additional experiments to measure the generalization of features in multilingual models to new languages.

pdf bib abs
keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
Saketh Vemula | Parameswari Krishnamurthy

Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM—a Multilingual Shared Task onHallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior and where improvement can be made.

pdf bib abs
Team A at SemEval-2025 Task 11: Breaking Language Barriers in Emotion Detection with Multilingual Models
P Sam Sahil | Anupam Jamatia

This paper describes the system submitted by Team A to SemEval 2025 Task 11, “Bridging the Gap in Text-Based Emotion Detection.” The task involved identifying the perceived emotion of a speaker from text snippets, with each instance annotated with one of six emotions: joy, sadness, fear, anger, surprise, or disgust. A dataset provided by the task organizers served as the foundation for training and evaluating our models. Among the various approaches explored, the best performance was achieved using multilingual embeddings combined with a fully connected layer. This paper details the system architecture, discusses experimental results, and highlights the advantages of leveraging multilingual representations for robust emotion detection in text.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Using Multiple Prediction Headers
Hao Yang | Jin Wang | Xuejie Zhang

This paper describes the our team’s participation in Subtask A of Task 11 at SemEval-2025, focusing on multilingual text-based emotion classification. The team employed the RoBERTa model, enhanced with modifications to the output head to allow independent prediction of six emotions: anger, disgust, fear, joy, sadness, and surprise. The dataset was translated into English using Google Translate to facilitate processing. The study found that a single prediction head outperformed simultaneous prediction of multiple emotions, and training on the translated dataset yielded better results than using the original dataset. The team incorporated Focal Loss and R-Drop techniques to address class imbalance and improve model stability. Future work will continue to explore improvements in this area.

pdf bib abs
I2R-NLP at SemEval-2025 Task 8: Question Answering on Tabular Data
Yuze Gao | Bin Chen | Jian Su

We present a Large Language Model (LLM) based system for question answering (QA) over tabular data that leverages multi-turn prompting to automatically generate executable Pandas functions. Our framework decomposes the problem into three key steps: (1) Answer Type Identification, where the system identifies the expected format of the response (e.g., boolean, number, category); (2) Pandas Function Generation, which generates a corresponding Pandas function using table metadata and in-context examples, and (3) Error Correction and Regeneration, where iteratively refining the function based on error feedback from executions. Evaluations on the SemEval-2025 Task 8 Tabular QA benchmark (Grijalba et al., 2024) demonstrate that our multi-turn approach significantly outperforms single-turn prompting models in exact match accuracy by 7.3%. The proposed system not only improves code generation robustness but also paves the way for enhanced and adaptability in table-QA reasoning tasks. Our implementation is available at https://github.com/Gyyz/Question_Answering-over-Tabular-Data.

This paper presents the approach we employed in SemEval-2025 Task 11: “Bridging the Gap in Text-Based Emotion Detection.” The core objective of this shared task is emotion perception, focusing on determining the emotion the speaker is likely expressing when uttering a sentence or short text fragment, as perceived by the majority. In this task, we applied a prompt optimization strategy based on in-context learning, combined with data augmentation and ensemble voting techniques, to significantly enhance the model’s performance. Through these optimizations, the model demonstrated improved accuracy and stability in emotion detection. Ultimately, in both Track A (Multi-label Emotion Detection) and Track B (Emotion Intensity Prediction), our approach achieved top-3 rankings across multiple languages, showcasing the effectiveness and cross-lingual adaptability of our method.

This paper presents the OZemi team’s submission to SemEval-2025 Task 11: Multilingual Emotion Detection and Intensity. Our approach prioritized computational efficiency, leveraging lightweight models that achieved competitive results even for low-resource languages. We addressed data imbalance through data augmentation techniques such as back translation and class balancing. Our system utilized multilingual BERT and machine translation to enhance performance across 35 languages. Despite ranking mid-tier overall, our results demonstrate that relatively simple models can yield adequate performance across diverse linguistic settings. We provide an error analysis of emotion classification challenges, particularly for nuanced expressions such as sarcasm and irony, and discuss the impact of emoji representation on model predictions. Finally, we outline future directions, including improvements in sentiment intensity modeling and the integration of semantic prosody to refine emotion detection.

In this work, we tackle the challenge of multi-label emotion classification, where a sentence can simultaneously express multiple emotions. This task is particularly difficult due to the overlapping nature of emotions and the limited context available in short texts. To address these challenges, we propose an ensemble approach that integrates Pre-trained Language Models (BERT-based models) and Large Language Models, each capturing distinct emotional cues within the text. The predictions from these models are aggregated through a voting mechanism, enhancing classification accuracy. Additionally, we incorporate threshold optimization and class weighting techniques to mitigate class imbalance. Our method demonstrates substantial improvements over baseline models. Our approach ranked 4th out of 90 on the English leaderboard and exhibited strong performance in English in SemEval-2025 Task 11 Track A.

pdf bib abs
ChuenSumi at SemEval-2025 Task 1: Sentence Transformer Models and Processing Idiomacity
Sumiko Teng | Chuen Shin Yong

This paper participates Task 1 of SemEval2025, specifically Subtask A’s English Text-Only track, where we develop a model to rank text descriptions of images with respect to how well it represents a the use of a given multi-word expression in its respective context sentence. We trained sentence transformer models from huggingface to rank the text descriptions, finding the RoBERTa model to be the better performing model. For the final evaluation, the fine-tuned RoBERTa model achieved an accuracy of 0.4 for the first developer’s evaluation set, and 0.2 for the second, ranking 9th in the English Text Only category for Subtask A. Overall, our results show that a vanilla sentence transformerapproach performs adequately in the task and processing idioms. They also suggest that RoBERTa models may be stronger in idiom processing than other models.

pdf bib abs
daalft at SemEval-2025 Task 1: Multi-step Zero-shot Multimodal Idiomaticity Ranking
David Alfter

This paper presents a multi-step zero-shot system for SemEval-2025 Task 1 on Advancing Multimodal Idiomaticity Representation (AdMIRe). The system employs two state-of-the-art multimodal language models, Claude Sonnet 3.5 and OpenAI GPT-4o, to determine idiomaticity and rank images for relevance in both subtasks. A hybrid approach combining o1-preview for idiomaticity classification and GPT-4o for visual ranking produced the best overall results. The system demonstrates competitive performance on the English extended dataset for Subtask A, but faces challenges in cross-lingual transfer to Portuguese. Comparing Image+Text and Text-Only approaches reveals interesting trends and raises questions about the role of visual information in multimodal idiomaticity detection.

pdf bib abs
Anastasia at SemEval-2025 Task 9: Subtask 1, Ensemble Learning with Data Augmentation and Focal Loss for Food Risk Classification.
Tung Le | Tri Ngo | Trung Dang

Our approach for the SemEval-2025 Task 9: Subtask 1, The Food Hazard Detection Challenge showcases a robust ensemble learning methodology designed to classify food hazards and associated products from incident report titles. By incorporating advanced data augmentation techniques, we significantly enhanced model generalization and addressed class imbalance through the application of focal loss. This strategic combination led to our team securing the Top 1 position, achieving an impressive score of 0.8223, underscoring the strength of our solution in improving classification performance for food safety risk assessment.

pdf bib abs
GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification
Iknoor Singh | Carolina Scarton | Kalina Bontcheva

The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automated narrative classification. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a predefined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. This result highlights the effectiveness of our method in improving narrative classification performance over the baselines.

pdf bib abs
DKE-Research at SemEval-2025 Task 7: A Unified Multilingual Framework for Cross-Lingual and Monolingual Retrieval with Efficient Language-specific Adaptation
Yuqi Wang | Kangshi Wang

This paper presents a unified framework for fact-checked claim retrieval, integrating contrastive learning with an in-batch multiple negative ranking loss and a conflict-aware batch sampler to enhance query-document alignment across languages. Additionally, we introduce language-specific adapters for efficient fine-tuning, enabling adaptation to previously unseen languages.

pdf bib abs
wangkongqiang at SemEval-2025 Task 11:Bridging the Gap in Text-Based Emotion Detection
Wang Kongqiang

This paper presents our system developed for the SemEval-2025 Task 11:Bridging the Gap in Text-Based Emotion Detection, on Track A: Multi-label Emotion Detection.Given a target text snippet, predict the perceived emotion(s) of the speaker. Specifically, select whether each of the following emotions apply: joy, sadness, fear, anger, surprise, or disgust. To this end, we focus on English source language selection strategies on four different pre-trained languages models: google-bert,FacebookAI-roberta,dccuchile-bert and distilbert-multi.We experiment with 1) the training set data is analyzed visually, 2) multiple numbers of single models are trained on the training set data, and 3) multiple number of single models for votingweight ensemble learning. We further study the influence of different hyperparameters on the integrated model and select the best integration model for the prediction of the test set. Our submission achieved the good ranking place in the test set.Emotion Macro F1 Score 0.6998 and Emotion Micro F1 Score 0.7374. For the final ranking, organizers will use the Macro F1 score.Even so, my approach has yielded good results.

pdf bib abs
UNEDTeam at SemEval-2025 Task 10: Zero-Shot Narrative Classification
Jesus M. Fraile - Hernandez | Anselmo Peñas

In this paper we present our participation in Subtask 2 of SemEval-2025 Task 10, focusing on the identification and classification of narratives in news of multiple languages, on climate change and the Ukraine-Russia war. To address this task, we employed a Zero-Shot approach using a generative Large Language Model without prior training on the dataset. Our classification strategy is based on two steps: first, the system classifies the topic of each news item; subsequently, it identifies the sub-narratives directly at the finer granularity. We present a detailed analysis of the performance of our system compared to the best ranked systems on the leaderboard, highlighting the strengths and limitations of our approach.

We propose a multilingual text processing framework that combines multilingual translation with data augmentation, QLoRA-based multi-model fine-tuning, and GLM-4-Plus-based ensemble classification. By using GLM-4-Plus to translate multilingual texts into English, we enhance data diversity and quantity. Data augmentation effectively improves the model’s performance on imbalanced datasets. QLoRA fine-tuning optimizes the model and reduces classification loss. GLM-4-Plus, as a meta-classifier, further enhances system performance. Our system achieved first place in three languages (English, Portuguese and Russian).

This paper presents our research in the SemEval-2025 Task 9: Food Hazard Detection Challenge, with a focus on the application of ModernBERT for food safety data classification. We applied the ModernBERT model for the food hazard classification task, achieving a score of 0.7952 on the validation set and 0.7729 on the final test set, outperforming other models. Through comparative experiments with various deep learning architectures, we further confirmed the superiority of ModernBERT in food hazard detection. The results demonstrate the significant potential of ModernBERT in food safety management, providing strong support for its practical applications in the field. The code of this paper is available at: https://github.com/daojiaxu/semeval_2025_Task-9.

pdf bib abs
TechSSN3 at SemEval-2025 Task 11: Multi-Label Emotion Detection Using Ensemble Transformer Models and Lexical Rules
Vishal S | Rajalakshmi Sivanaiah | Angel Deborah S

Transformer models, specifically BERT-Large Uncased, DeBERTa, and RoBERTa, are first employed to classify the dataset, with their hyperparameters being fine-tuned to identify the most effective configuration. These models leverage deep contextual embeddings to capture nuanced semantic and syntactic information, making them powerful for sentiment analysis. However, transformer-based models alone may not fully capture the structural aspects of sentiment-bearing sentences.To address this, part-of-speech (POS) tagging is incorporated using a Hidden Markov Model (HMM) to analyze sentence structure and identify the key words responsible for conveying sentiment. By isolating adjectives, adverbs, and verbs, the lexical sentiment of individual words is determined using a polarity-based scoring method. This lexical score, derived from sentiment lexicons like SentiWordNet, provides an additional layer of interpretability, particularly in cases where transformer models struggle with implicit sentiment cues or negation handling.A key innovation in this approach is the adaptive weighting mechanism used to combine the outputs of the transformer models and lexical scoring. Instead of assigning uniform importance to each method, a unique weight is assigned to each model for every emotion category, ensuring that the best-performing approach contributes more significantly to the final sentiment prediction. For instance, DeBERTa, which excels in contextual understanding, is given more weight for subtle emotions like sadness, whereas lexical scoring is emphasized for emotions heavily influenced by explicit adjectives, such as joy or anger. The weight allocation is determined empirically through performance evaluation on a validation set, ensuring an optimal balance between deep learning-based contextual understanding and rule-based sentiment assessment.Additionally, traditional machine learning models such as Support Vector Machines (SVMs), Decision Trees, and Random Forests are tested for comparative analysis. However, these models demonstrate inferior performance, struggling with capturing deep contextual semantics and handling nuanced expressions of sentiment, reinforcing the superiority of the hybrid transformer + lexical approach.This method not only enhances interpretability but also improves accuracy, particularly in cases where sentiment is influenced by structural elements, negations, or compound expressions. The combined framework ensures a more robust and adaptable sentiment analysis model, effectively balancing data-driven learning and linguistic insights.

pdf bib abs
iai_MSU at SemEval-2025 Task-3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes in English
Mikhail Pukemo | Aleksandr Levykin | Dmitrii Melikhov | Gleb Skiba | Roman Ischenko | Konstantin Vorontsov

This paper presents the submissions of the iai_MSU team for SemEval-2025 Task 3 – Mu-SHROOM, where we achieved first place in the English language. The task involves detecting hallucinations in model-generated text, which requires systems to verify claims against reliable sources.In this paper, we present our approach to hallucination detection, which employs a three-stage system. The first stage uses a retrieval-based (Lewis et al., 2021) to verify claims against external knowledge sources. The second stage applies the Self-Refine Prompting (Madaan et al., 2023) to improve detection accuracy by analyzing potential errors of the first stage. The third stage combines predictions from the first and second stages into an ensemble.Our system achieves state-of-the-art performance on the competition dataset, demonstrating the effectiveness of combining retrieval-augmented verification with Self-Refine Prompting. The code for the solutions is available on https://github.com/pansershrek/IAI_MSU.

pdf bib abs
CIOL at SemEval-2025 Task 11: Multilingual Pre-trained Model Fusion for Text-based Emotion Recognition
Md. Hoque | Mahfuz Ahmed Anik | Abdur Rahman | Azmine Toushik Wasi

Multilingual emotion detection is a critical challenge in natural language processing, enabling applications in sentiment analysis, mental health monitoring, and user engagement. However, existing models struggle with overlapping emotions, intensity quantification, and cross-lingual adaptation, particularly in low-resource languages. This study addresses these challenges as part of SemEval-2025 Task 11 by leveraging language-specific transformer models for multi-label classification (Track A), intensity prediction (Track B), and cross-lingual generalization (Track C). Our models achieved strong performance in Russian (Track A: 0.848 F1, Track B: 0.8594 F1) due to emotion-rich pretraining, while Chinese (0.483 F1) and Spanish (0.6848 F1) struggled with intensity estimation. Track C faced significant cross-lingual adaptation issues, with Russian (0.3102 F1), Chinese (0.2992 F1), and Indian (0.2613 F1) highlighting challenges in low-resource settings. Despite these limitations, our findings provide valuable insights into multilingual emotion detection. Future work should enhance cross-lingual representations, address data scarcity, and integrate multimodal information for improved generalization and real-world applicability.

pdf bib abs
UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval
Ladislav Lenc | Daniel Cífka | Jiri Martinek | Jakub Šmíd | Pavel Kral

This paper presents a zero-shot system for fact-checked claim retrieval. We employed several state-of-the-art large language models to obtain text embeddings. The models were then combined to obtain the best possible result. Our approach achieved 7th place in monolingual and 9th in cross-lingual subtasks. We used only English translations as an input to the text embedding models since multilingual models did not achieve satisfactory results. We identified the most relevant claims for each post by leveraging the embeddings and measuring cosine similarity. Overall, the best results were obtained by the NVIDIA NV-Embed-v2 model. For some languages, we benefited from model combinations (NV-Embed & GPT or Mistral).

The proliferation of structured tabular data in domains like healthcare and finance has intensified the demand for precise table question answering, particularly for complex numerical reasoning and cross-domain generalization. Existing approaches struggle with implicit semantics and multi-step arithmetic operations. This paper presents our solution for SemEval-2025 task,including three synergistic components: (1) a Schema Profiler that extracts structural metadata via LLM-driven analysis and statistical validation, (2) a Hierarchical Chain-of-Thought module that decomposes questions into four stages(semantic anchoring, schema mapping, query synthesis, and self-correction)to ensure SQL validity, and (3) a Confidence-Accuracy Voting mechanism that resolves discrepancies across LLMs through weighted ensemble decisions. Our framework achieves scores of 81.23 on Databench and 81.99 on Databench_lite, ranking 6th and 5th respectively, demonstrating the effectiveness of structured metadata guidance and cross-model deliberation in complex TableQA scenarios.

pdf bib abs
DataBees at SemEval-2025 Task 11: Challenges and Limitations in Multi-Label Emotion Detection
Sowmya Anand | Tanisha Sriram | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thankanadar

Text-based emotion detection is crucial in NLP,with applications in sentiment analysis, socialmedia monitoring, and human-computer interaction. This paper presents our approach tothe Multi-label Emotion Detection challenge,classifying texts into joy, sadness, anger, fear,and surprise. We experimented with traditionalmachine learning and transformer-based models, but results were suboptimal: F1 scores of0.3723 (English), 0.5174 (German), and 0.6957(Spanish). We analyze the impact of preprocessing, model selection, and dataset characteristics, highlighting key challenges in multilabel emotion classification and potential improvements.

pdf bib abs
QM-AI at SemEval-2025 Task 6: an Ensemble of BERT Models for Promise Identification in ESG Context
Zihang Sun | Filip Sobczak

This paper presents our approach and findings in the SemEval-2025 Task 6: Multinational, Multilingual, Multi-industry Promise Verification (PromiseEval), which focuses on verifying promises in the industrial Environmental, Social, and Governance (ESG) reports. Specifically, we participate in the first subtask of the PromiseEval shared task, promise identification. We tackle this subtask by building an ensemble of four BERT models trained in different experimental configurations, and deploying logistic regression as meta-model. Each configuration has a different combination of two variables: whether augmented data is used, and whether English translation is used. We find out that the BERT model trained without augmented data or English translation not only has the best evaluation results on the test data in most languages, but also has higher robustness than the meta-model. We submitted results from the meta-model to the leaderboard, and rank the first place in Japanese and Korean, the second place in French and Chinese, and the seventh place in English.

pdf bib
YNU-HPCC at SemEval-2025 Task 7: Multilingual and Cross-lingual Fact-checked Claim Retrieval
Yuheng Mao | Jin Wang | Xuejie Zhang

pdf bib abs
adithjrajeev at SemEval-2025 Task 10: Sequential Learning for Role Classification Using Entity-Centric News Summaries
Adith Rajeev | Radhika Mamidi

There is a high prevalence of disinformation and manipulative narratives in online news sources today, and verification of its informative integrity is a vital need as online audience is highly susceptible to being affected by such propaganda or disinformation. The task of verifying any online information is, however, a significant challenge. The task Multilingual Characterization and Extraction of Narratives from Online News, therefore focuses on developing novel methods of analyzing news ecosystems and detecting manipulation attempts to address this challenge. As a part of this effort, we focus on the subtask of Entity Framing, which involves assigning named entities in news articles one of three main roles ( Protagonist, Antagonist, and Innocent) with a further fine-grained role distinction. We propose a pipeline that involves summarizing the article with the summary being centered around the entity. The entity and its entity-centric summary is then used as input for a BERT-based classifier to carry out the final role classification. Finally, we experiment with different approaches in the steps of the pipeline and compare the results obtained by them.

pdf bib abs
xiacui at SemEval-2025 Task 11: Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss
Xia Cui

This paper explores the application of a simplified weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.

pdf bib abs
UZH at SemEval-2025 Task 3: Token-Level Self-Consistency for Hallucination Detection
Michelle Wastl | Jannis Vamvas | Rico Sennrich

This paper presents our system developed for the SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The objective of this task is to identify spans of hallucinated text in the output of large language models across 14 high- and low- resource languages. To address this challenge, we propose two consistency-based approaches: (a) token-level consistency with a superior LLM and (b) token-level self-consistency with the underlying model of the sequence that is to be evaluated. Our results show effectiveness when compared to simple mark-all baselines, competitiveness to other submissions of the shared task and for some languages to GPT4o- mini prompt-based approaches.

pdf bib abs
NCL-UoR at SemEval-2025 Task 3: Detecting Multilingual Hallucination and Related Observable Overgeneration Text Spans with Modified RefChecker and Modified SeflCheckGPT
Jiaying Hong | Thanet Markchom | Jianfei Xu | Tong Wu | Huizhi Liang

SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in content generated by various large language models (LLMs) across multiple languages. This task involves not only identifying the presence of hallucinations but also pinpointing their specific occurrences. To tackle this challenge, this study introduces two methods: modified RefChecker and modified SelfCheckGPT. The modified RefChecker integrates prompt-based factual verification into References, structuring them as claim-based tests rather than single external knowledge sources. The modified SelfCheckGPT ~incorporates external knowledge to overcome its reliance on internal knowledge. In addition, both methods’ original prompt designs are enhanced to identify hallucinated words within LLM-generated texts. Experimental results demonstrate the effectiveness of the approach, achieving a high ranking on the test dataset in detecting hallucinations across various languages, with an average IoU of 0.5310 and an average COR of 0.5669.

pdf bib abs
UniBuc at SemEval-2025 Task 9: Similarity Approaches to Classification
Marius Micluta - Campeanu

In this paper, we present a similarity-based method for explainable classification in the context of the SemEval 2025 Task 9: The Food Hazard Detection Challenge. Our proposed system is essentially unsupervised, leveraging the semantic properties of the labels. This approach brings some key advantages over typical classification systems. First, similarity metrics offer a more intuitive interpretation. Next, this technique allows for inference on novel labels. Finally, there is a non-negligible amount of ambiguous labels, so learning a direct mapping does not lead to meaningful representations.Our team ranks 13th for the second sub-task among participants that used only the title and the text as features. Our method is generic and can be applied to any classification task.

pdf bib abs
UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Thanet Markchom | Tong Wu | Liting Huang | Huizhi Liang

SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance.Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning.

pdf bib abs
NlpUned at SemEval-2025 Task 10: Beyond Training: A Taxonomy-Guided Approach to Role Classification Using LLMs
Alberto Caballero | Alvaro Rodrigo | Roberto Centeno

The paper presents a taxonomy-guided approach to role classification in news articles using Large Language Models (LLMs). Instead of traditional model training, the system employs zero-shot and few-shot prompting strategies, leveraging structured taxonomies and contextual cues for classification. The study evaluates hierarchical and single-step classification approaches, finding that a unified, single-step model with contextual preprocessing achieves the best performance. The research underscores the importance of input structuring and classification strategy in optimizing LLM performance for real-world applications.

pdf bib abs
tinaal at SemEval-2025 Task 11: Enhancing Perceived Emotion Intensity Prediction with Boosting Fine-Tuned Transformers
Ting Zhu | Liting Huang | Huizhi(elly) Liang

This paper presents a framework for perceived emotion intensity prediction, focusing on SemEval-2025 Task 11 Track B. The task involves predicting the intensity of five perceived emotions—anger, fear, joy, sadness, and surprise—on an ordinal scale from 0 (no emotion) to 3 (high emotion). Our approach builds upon our method introduced in the WASSA workshop and enhances it by integrating ModernBERT in place of the traditional BERT model within a boosting-based ensemble framework. To address the difficulty in capturing fine-grained emotional distinctions, we incorporate class-preserving mixup data augmentation, a custom Pearson CombinLoss function, and fine-tuned transformer models, including ModernBERT, RoBERTa, and DeBERTa. Compared to individual fine-tuned transformer models (BERT, RoBERTa, DeBERTa and ModernBERT) without augmentation or ensemble learning, our approach demonstrates significant improvements. The proposed system achieves an average Pearson correlation coefficient of 0.768 on the test set, outperforming the best individual baseline model. In particular, the model performs best for sadness (r = 0.808) and surprise (r = 0.770), highlighting its ability to capture subtle intensity variations in the text. Despite these improvements, challenges such as data imbalance, performance on low-resource emotions (e.g., anger and fear), and the need for refined data augmentation techniques remain open for future research.

pdf bib abs
NCL-AR at SemEval-2025 Task 7: A Sieve Filtering Approach to Refute the Misinformation within Harmful Social Media Posts
Alex Robertson | Huizhi(elly) Liang

In this paper, we propose a sieve filtering-based approach that can retrieve facts to invalidate claims made in social media posts. The fact filters are initially coarse-grained, based on the original language of the social media posts, and end with fine-grained filters based on the exact time frame in which the posts were uploaded online. This streamlined approach achieved a 0.883 retrieval success rate in the monolingual task while maintaining a scalable efficiency level of processing a social media post per 0.07 seconds.

pdf bib abs
XLM-Muriel at SemEval-2025 Task 11: Hard Parameter Sharing for Multi-lingual Multi-label Emotion Detection
Pouya Hosseinzadeh | Mohammad Mehdi Ebadzadeh | Hossein Zeinali

Throughout this paper we present our system developed to solve SemEval-2025 Task 11: Bridging the Gap in Text-based Emotion Detection Track A. To participate in this contest, we use an architecture based on a pretrained encoder model as the shared part of the model and then add specific head to adapt the shared part for each language. In the first part of this report, we will introduce the task and the specific track in which we participated and then elaborate on the dataset and the system we developed to handle the task. Finally, we will analyze our results and discuss limitations and potential strength point of our solution that could be leveraged in future work to improve results on similar tasks.

pdf bib abs
Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning
Tian Li | Yujian Sun | Huizhi(elly) Liang

The SemEval-2025 Task 11, Bridging the Gap in Text-Based Emotion Detection, introduces an emotion recognition challenge spanning over 28 languages. This competition encourages researchers to explore more advanced approaches to address the challenges posed by the diversity of emotional expressions and background variations. It features two tracks: multi-label classification (Track A) and emotion intensity prediction (Track B), covering six emotion categories: anger, fear, joy, sadness, surprise, and disgust.In our work, we systematically explore the benefits of two contrastive learning approaches: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO) contrastive learning. The sample-based contrastive approach trains the model by comparing two samples to generate more reliable predictions. The generation-based contrastive approach trains the model to differentiate between correct and incorrect generations, refining its prediction. All models are fine-tuned from LLaMa3-Instruct-8B. Our system achieves 12th place in Track A and 7th place in Track B for English, while ranking among the top-tier performing systems for other languages.

pdf bib abs
UIMP-Aaman at SemEval-2025 Task11: Detecting Intensity and Emotion in Social Media and News
Aisha Aman - Parveen

This paper presents our participation in SemEval task 11, which consists of emotion recognition in sentences written in multiple languages. We use in-context learning and fine-tuning methods to teach LLMs how to predict labels for Track A, Track B and Track C. The best results depends on track and language predicted.

pdf bib abs
CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris

Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM for each language.

pdf bib abs
OseiBrefo-Liang at SemEval-2025 Task 8 : A Multi-Agent LLM code generation approach for answering Tabular Questions
Emmanuel Osei - Brefo | Huizhi(elly) Liang

This paper presents a novel multi-agent framework for automated code generation and execution in tabular question answering. Developed for the SemEval-2025 Task 8, our system utilises a structured, multi-agent approach where distinct agents handle dataset extraction, schema identification, prompt engineering, code generation, execution, and prediction. Unlike traditional methods such as semantic parsing-based SQL generation and transformer-based table models such as TAPAS, our approach leverages a large language model-driven code synthesis pipeline using the DeepSeek API. Our system follows a zero-shot inference approach, which generates Python functions that operate directly on structured data. Through the dynamic extraction of dataset schema and intergration into structured prompts, the model comprehension of tabular structures is enhanced, which leads to more precise and interpretable results. Experimental results demonstrate that our system outperforms existing tabular questioning and answering models, achieving an accuracy of 84.67% on DataBench and 86.02% on DataBench-lite, which significantly surpassed the performances of TAPAS (2.68%) and stable-code-3b-GGUF (27%). The source code used in this paper is available at t https://github.com/oseibrefo/semEval25task8

pdf bib abs
INFOTEC-NLP at SemEval-2025 Task 11: A Case Study on Transformer-Based Models and Bag of Words
Emmanuel Santos - Rodriguez | Mario Graff

Leveraging transformer-based models as feature extractors, we introduce a hybrid architecture that integrates a bidirectional LSTM network with a multi-head attention mechanism to address the challenges of multilingual emotion detection in text. While pre-trained transformers provide robust contextual embeddings, they often struggle with capturing long-range dependencies and handling class imbalances, particularly in low-resource languages. To mitigate these issues, our approach combines sequential modeling and attention mechanisms, allowing the model to refine representations by emphasizing key emotional cues in text.

pdf bib abs
sonrobok4 Team at SemEval-2025 Task 8: Question Answering over Tabular Data Using Pandas and Large Language Models
Nguyen Son | Dang Thin

This paper describes the system of the son robok4 team for the SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The task requires answering questions based on the given question and dataset ID, ensuring that the responses are derived solely from the provided table. We address this task by using large language models (LLMs) to translate natural language questions into executable Python code for querying Pandas DataFrames. Furthermore, we employ techniques such as a rerun mechanism for error handling, structured metadata extraction, and dataset preprocessing to enhance performance. Our best-performing system achieved 89.46% accuracy on Subtask 1 and placed in the top 4 on the private test set. Additionally, it achieved 85.25% accuracy on Subtask 2 and placed in the top 9. We mainly focus on Subtask 1. We analyze the effectiveness of different LLMs for structured data reasoning and discuss key challenges in tabular question answering.

This paper introduces DUTIR831’s approach to SemEval-2025 Task 5, which focuses on generating relevant subjects from the Integrated Authority File (GND) for tagging multilingual technical records in the TIBKAT database. To address challenges in understanding the hierarchical GND taxonomy and automating subject assignment, a three-stage approach is proposed: (1) a data synthesis stage that utilizes LLM to generate and selectively filter high-quality data, (2) a model training module that leverages LLMs and various training strategies to acquire GND knowledge and refine TIBKAT preferences, and (3) a subject terms completion mechanism consisting of multi-sampling ranking, subject terms extraction using a LLM, vector-based model retrieval, and various re-ranking strategies.The quantitative evaluation results show that our system is ranked 2nd in the all-subject datasets and 4th in the tib-core-subjects datasets. And the qualitative evaluation results show that the system is ranked 2nd in the tib-core-subjects datasets.

This paper presents our system for Subtask 10 of Entity Framing, which focuses on assigning one or more hierarchical roles to named entities in news articles. Our approach iteratively refines prompts and utilizes the Entity-Centric Chain of Thought to complete the task. Specifically, to minimize ambiguity in label definitions, we use the model’s predictions as supervisory signals, iteratively refining the category definitions. Furthermore, to minimize the interference of irrelevant information during inference, we incorporate entity-related information into the CoT framework, allowing the model to focus more effectively on entity-centric reasoning. Our system achieved the highest ranking on the leaderboard in the Russian main role classification and the second in English, with an accuracy of 0.8645 and 0.9362, respectively. We discuss the impact of several components of our multilingual classification approach, highlighting their effectiveness.

pdf bib abs
NCL-NLP at SemEval-2025 Task 11: Using Prompting engineering framework and Low Rank Adaptation of Large Language Models for Multi-label Emotion Detection
Kun Lu

The paper presented a prompt engineer framework to further improve the performance of generative models on multi-label classification tasks which released in SemEval-2025 Task 11 Track A. This task is used to predict the presence of all emotions contained in a text segment, namely joy, fear, anger, surprise, and sadness. The generative large language model, fine-tuned with instructions, can accomplish multi-label classification tasks to a certain extent; however, there is still room for improvement in its correctness and accuracy. To address these problems, we proposed a framework for prompt engineering to further enhance performance, while using the specifications of instruction fine-tuning to generate the model’s response results. Compared to the method of fine-tuning using simple instructions, our system improved the overall macro F1 score by 0.3. There has been a significant improvement in the accuracy of each individual category. In the final ranking, a good performance was achieved. Nevertheless, the system still has certain issues, as the results of local validation may differ from the results of official competitions. This could be due to the training samples being insufficient and unbalanced. Therefore, the system can still improve its performance through feature engineering and other data enhancement methods.

pdf bib abs
BERTastic at SemEval-2025 Task 10: State-of-the-Art Accuracy in Coarse-Grained Entity Framing for Hindi News
Tarek Mahmoud | Zhuohan Xie | Preslav Nakov

We describe our system for SemEval-2025 Task 10 Subtask 1 on coarse-grained entity framing in Hindi news, exploring two complementary strategies. First, we experiment with LLM prompting using GPT-4o, comparing hierarchical multi-step prompting with native single-step prompting for both main and fine-grained role prediction. Second, we conduct an extensive study on fine-tuning XLM-R, analyzing different context granularities (full article, paragraph, or sentence-level entity mentions), monolingual vs. multilingual settings, and main vs. fine-grained role labels. Our best system, trained on fine-grained role annotations across languages using sentence-level context, achieved 43.99% exact match, 56.56 % precision, 47.38% recall, and 51.57% F1-score. Notably, our system set a new state-of-the-art for main role prediction on Hindi news, achieving 78.48 % accuracy - outperforming the next best model at 76.90%, as per the official leaderboard. Our findings highlight effective strategies for entity framing in multilingual and low-resource settings.

pdf bib abs
Heimerdinger at SemEval-2025 Task 11: A Multi-Agent Framework for Perceived Emotion Detection in Multilingual Text
Zeliang Tong | Zhuojun Ding | Yingjia Li

This paper presents our system developed for the SemEval-2025 Task 11: Text-Based Emotion Detection (TBED) task, which aims to identify the emotions perceived by the majority of people from a speaker’s short text. We introduce a multi-agent framework for emotion recognition, comprising two key agents: the Emotion Perception Profiler, which identifies emotions in text, and the Intensity Perception Profiler, which assesses the intensity of those emotions. We model the task using both generative and discriminative approaches, leveraging BERT series and large-scale generative language models (LLMs). A multi-system collaboration mechanism is employed to further enhance the accuracy, stability, and robustness. Additionally, we incorporate cross-lingual knowledge transfer to improve performance in diverse linguistic scenarios. Our method demonstrates superior results in emotion detection and intensity prediction across multiple subtasks, highlighting its effectiveness, especially in language adaptability.

pdf bib abs
Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models
Dinesh Srivasthav P | Bala Mallikarjunarao Garlapati

Large Language Models (LLMs) face significant challenges in maintaining privacy, ethics, and compliance, when sensitive or obsolete data must be selectively removed. Retraining these models from scratch is computationally infeasible, necessitating efficient alternatives. As part of the SemEval 2025 Task 4, this work focuses on the application of selective unlearning in LLMs to address this challenge. In this paper, we present our experiments and findings, primarily leveraging global weight modification to achieve an equilibrium between effectiveness of unlearning, knowledge retention, and target model’s post-unlearning utility. We also detail the task-specific evaluation mechanism, results, and challenges. Our algorithms have achieved an aggregate score of 0.409 and 0.389 on the test set for 7B and 1B target models, respectively, demonstrating promising results in verifiable LLM unlearning.

pdf bib abs
NCLTeam at SemEval-2025 Task 10: Enhancing Multilingual, multi-class, and Multi-Label Document Classification via Contrastive Learning Augmented Cascaded UNet and Embedding based Approaches
Shu Li | George Williamson | Huizhi Liang

The SemEval 2025 Task 10 Subtask2 presents a multi-task multi-label text classification challenge. The task requires systems to classify documents simultaneously across three distinct topics, the Climate Change(CC), the Ukraine Russia War(URW), and others. Several challenge were identified, including the instinct distinct of topics, the imbalance of categories, the insufficient samples, and the different distribution of develop set and test set. To address these challenges, two deep learning model have been implemented. One of the approach is the Contrastive learning augmented Cascaded UNet model(CCU), which employs a cascaded architecture to jointly process all subtasks. This model incorporates an UNet-style architecture to classify embeddings extracted by the base text encoder. A domain adaption method was implemented to facilitate joint learning across different document topics. We address the data insufficiency through contrastive learning and mitigate data imbalance using asymmetric loss function. We also implemented a shallow machine learning model. In this approach, transformer encoder models were applied to extract text embedding from various aspect, then deploy machine learning method to do the classification and compared with the base line. The UNet-style model achieves the highest f1 sample at 0.365 on the test set of 5th place compared with all approaches on leader board. Our source code developed for this paper are available at

pdf bib abs
DUTtask10 at SemEval-2025 Task 10: ThoughtFlow: Hierarchical Narrative Classification via Stepwise Prompting
Du Py | Huayang Li | Liang Yang | Zhang Shaowu

This paper describes our system for SemEval-2025 Task 10: Hierarchical Narrative Classification. We propose a two-step hierarchical approach that combines generative reasoning and fine-tuning for sub-narrative classification. The main techniques of our system are: 1) leveraging a large pre-trained model to generate a reasoning process for better context understanding, 2) fine-tuning the model for precise sub-narrative categorization, 3) using a multi-label classification strategy for more accurate sub-narrative identification, and 4) incorporating data augmentation to increase the diversity and robustness of the training data. Our system ranked 1st in Subtask 2 for Hindi, achieving an F1 macro coarse score of 0.56900 and an F1 samples score of 0.53500. The results demonstrate the effectiveness of our approach in classifying narratives and sub-narratives in a multilingual setting, with the additional benefit of enhanced model performance through data augmentation.

pdf bib abs
Lotus at SemEval-2025 Task 11: RoBERTa with Llama-3 Generated Explanations for Multi-Label Emotion Classification
Niloofar Ranjbar | Hamed Baghbani

This paper presents a novel approach for multi-label emotion detection, where Llama-3 is used to generate explanatory content that clarifies ambiguous emotional expressions, thereby enhancing RoBERTa’s emotion classification performance. By incorporating explanatory context, our method improves F1-scores, particularly for emotions like fear, joy, and sadness, and outperforms text-only models. The addition of explanatory content helps resolve ambiguity, addresses challenges like overlapping emotional cues, and enhances multi-label classification, marking a significant advancement in emotion detection tasks.

pdf bib abs
LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles
Egil Rønningstad | Gaurav Negi

Our contribution to the SemEval shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show how simple entity-oriented heuristics for context selection and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.

pdf bib abs
CCNU at SemEval-2025 Task 3: Leveraging Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation
Xu Liu | Guanyi Chen

We present the system developed by the Central China Normal University (CCNU) team for the Mu-SHROOM shared task, which focuses on identifying hallucinations in question-answering systems across 14 different languages. Our approach leverages multiple Large Language Models (LLMs) with distinct areas of expertise, employing them in parallel to annotate hallucinations, effectively simulating a crowdsourcing annotation process. Furthermore, each LLM-based annotator integrates both internal and external knowledge related to the input during the annotation process. Using the open-source LLM DeepSeek-V3, our system achieves the top ranking (#1) for Hindi data and secures a Top-5 position in seven other languages. In this paper, we also discuss unsuccessful approaches explored during our development process and share key insights gained from participating in this shared task.

pdf bib abs
ClaimCatchers at SemEval-2025 Task 7: Sentence Transformers for Claim Retrieval
Rrubaa Panchendrarajan | Rafael Frade | Arkaitz Zubiaga

Retrieving previously fact-checked claims from verified databases has become a crucial area of research in automated fact-checking, given the impracticality of manual verification of massive online content. To address this challenge, SemEval 2025 Task 7 focuses on multilingual previously fact-checked claim retrieval. This paper presents the experiments conducted for this task, evaluating the effectiveness of various sentence transformer models—ranging from 22M to 9B parameters—in conjunction with retrieval strategies such as nearest neighbor search and reranking techniques. Further, we explore the impact of learning context-specific text representation via finetuning these models. Our results demonstrate that smaller and medium-sized models, when optimized with effective finetuning and reranking, can achieve retrieval accuracy comparable to larger models, highlighting their potential for scalable and efficient misinformation detection.

pdf bib abs
NEKO at SemEval-2025 Task 4: A Gradient Ascent Based Machine Unlearning Strategy
Chi Kuan Lai | Yifei Chen

The power and wide application of large language models (LLMs) has brought the concerns on its risk of leaking private or sensitive information. However, retraining the modules is expensive and impractical, which introduces machine unlearning - removing specific information from language models while preserving general utility. Task 4 at SemEval 2025 consists of a shared task with this exact objective. We present an approach which combines gradient ascent-based forgetting with Kullback-Leibler (KL) divergence-based retention, applied to a 1-billion-parameter causal language model. Despite achieving effective forgetting, the system struggles with maintaining model utility. Our experiments reveal critical trade-off between unlearning effectiveness and performance preservation, highlighting challenges in practical machine unlearning implementations.

pdf bib abs
Team Unibuc - NLP at SemEval-2025 Task 11: Few-shot text-based emotion detection
Claudiu Creanga | Teodor - George Marchitan | Liviu Dinu

This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. Withour final system, for the multi-label emotion detection track (track A), we got an F1-macro of 0.7546 (26/96 teams) for the English subset, 0.1727 (35/36 teams) for the Portuguese (Mozambican) subset and 0.325 (1/31 teams) for the Emakhuwa subset.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 2: Local Cache and Online Retrieval-Based method for Entity-Aware Machine Translation
Hao Li | Jin Wang | Xuejie Zhang

This paper presents methods for {textbf{SemEval-2025 Task 11}} on text-based emotion detection across three tracks: Multi-label Emotion Detection, Emotion Intensity Prediction, and Cross-lingual Emotion Detection. We apply approaches such as supervised fine-tuning, preference-based reinforcement learning, and few-shot learning to enhance performance. Our combined strategies result in improved accuracy, particularly in multi-label and cross-lingual emotion detection, demonstrating the effectiveness of these methods in diverse linguistic settings.

pdf bib abs
TechSSN3 at SemEval-2025 Task 9: Food Hazard and Product Detection - Category Identification and Vector Prediction
Rajalakshmi Sivanaiah | Karpagavalli S | Karthikeyan S | Krithika C

Food safety is a critical global concern, and timely detection of food-related hazards is essential for public health and economic stability. The automated detection of food hazards from textual data can enhance food safety monitoring by enabling early identification of potential risks. In the Food Hazard Detection task, we address two key challenges: (ST1) food hazard-category and product-category classification and (ST2) food hazard and product vector detection. For ST1, we employ BertForSequenceClassification, leveraging its powerful contextual understanding for accurate food hazard classification. For ST2, we utilize a Random Forest Classifier, which effectively captures patterns in the extracted features for food hazard and product vector detection. This paper presents the results of the TechSSN3 team at SemEval-2025 Food Hazard Detection Task .

pdf bib abs
MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps
Maximiliano Hormazábal Lagos | Álvaro Bueno Sáez | Héctor Cerezo - Costas | Pedro Alonso Doval | Jorge Alcalde Vesteiro

In this paper we expose our approach to solve the SemEval 2025 Task 8: Question-Answering over Tabular Data challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of 70.50% for subtask 1.

pdf bib abs
CYUT at SemEval-2025 Task 6: Prompting with Precision – ESG Analysis via Structured Prompts
Shih - Hung Wu | Z h i - H o n g Lin | Ping - Hsuan Lee

In response to the increasing need for efficientESG verification, we propose an innovativeNLP framework that automates the evaluationof corporate sustainability claims. Ourmethod integrates Retrieval-Augmented Generation,Chain-of-Thought reasoning, and structuredprompt engineering to effectively processand classify diverse, multilingual ESG disclosures.Evaluated under the SemEval-2025PromiseEval competition, our system achievedtop-tier performance—securing first place onthe public English leaderboard, excelling in theFrench track, and delivering marked improvementsover conventional machine learning approaches.These results highlight the framework’spotential to offer a scalable, transparent,and robust solution for corporate ESG assessment.

pdf bib abs
Angeliki Linardatou at SemEval-2025 Task 11: Multi-label Emotion Detection
Angeliki Linardatou | Paraskevi Platanou

This study, competing in SemEval 2025 Task 11 - Track A, detects anger, surprise, joy, fear, and sadness. We propose a hybrid approach combining fine-tuned BERT transformers, TF-IDF for lexical analysis, and a Voting Classifier (Logistic Regression, Random Forest, SVM, KNN, XG-Boost, LightGBM, CatBoost), with grid search optimizing thresholds. Our model achieves a macro F1-score of0.6864. Challenges include irony, ambiguity, and label imbalance. Future work will explore larger transformers, data augmentation, and cross-lingual adaptation. This research underscores the benefits of hybrid models, showing that combining deep learning with traditional NLP improves multi-label emotion detection.

pdf bib abs
YNWA_PZ at SemEval-2025 Task 11: Multilingual Multi-Label Emotion Classification
Mohammad Sadegh Poulaei | Mohammad Erfan Zare | Mohammad Reza Mohammadi | Sauleh Eetemadi

This paper explores multilingual emotion classification across binary classification, intensity estimation, and cross-lingual detection tasks. To address linguistic variability and limited annotated data, we evaluate various deep learning approaches, including transformer-based embeddings and traditional classifiers. After extensive experimentation, language-specific embedding models were selected as the final approach, given their superior ability to capture linguistic and cultural nuances. Experiments on high- and low-resource languages demonstrate that this method significantly improves performance, achieving competitive macro-average F1 scores. Notably, in languages such as Tigrinya and Kinyarwanda for cross-lingual detection task, our approach achieved a second-place ranking, driven by the incorporation of advanced preprocessing techniques. Despite these advances, challenges remain due to limited annotated data in underrepresented languages and the complexity of nuanced emotional expressions. The study highlights the need for robust, language-aware emotion recognition systems and emphasizes future directions, including expanding multilingual datasets and refining models.

pdf bib abs
DUTir at SemEval-2025 Task 4: Optimized Fine-Tuning of Linear Layers for Balanced Knowledge Forgetting and Retention
Zekun Wang | Jingjie Zeng | Yingxu Li | Liang Yang | Hongfei Lin

This paper describes our system used in SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models. In this work, we propose a method for controlling the fine-tuning of a model’s linear layers, referred to as CTL-Finetune (Control-Tuned Linear Fine-tuning). The goal of our method is to allow the model to forget specific information while preserving the knowledge it needs to retain. The method consists of four main components: 1) shuffling data labels, 2) shuffling label gradient calculation, 3) determination of control layers, and 4) fine-tuning using a combination of gradient ascent and gradient descent. Experimental results demonstrate that our approach effectively enables the model to forget targeted knowledge while minimizing the impact on retained information, thus maintaining the model’s overall performance.

pdf bib abs
Zuifeng at SemEval-2025 Task 9: Multitask Learning with Fine-Tuned RoBERTa for Food Hazard Detection
Dapeng Sun | Sensen Li | Yike Wang | Shaowu Zhang

This paper describes our system used in theSemEval-2025 Task 9 The Food Hazard Detec-tion Challenge. Through data processing thatremoves elements and shared multi-task archi-tecture improve the performance of detection.Without complex architectural modificationsthe proposed method achieves competitive per-formance with 0.7835 Marco F1-score on sub-task 1 and 0.4712 Marco F1-score on sub-task2. Comparative experiments reveal that jointprediction outperforms separate task trainingby 1.3% F1-score, showing the effectiveness ofmulti-task learning of this challenge

pdf bib abs
IEGPS-CSIC at SemEval-2025 Task 11: BERT-based approach for Multi-label Emotion Detection in English and Russian texts
Albina Sarymsakova | Patricia Martin - Rodilla

This paper presents an original approach for SemEval 2025 Task 11. Our study investigates various strategies to improve Text-Based Multi-label Emotion Detection task. Through experimental endeavors, we explore the benefits of contextualized vector representations by comparing multiple BERT models, including those specifically trained for emotion recognition. Additionally, we examine the impact of hyperparameters adjustments on model performance. For Subtask A, our approach achieved F1 scores of 0.71 on the English dataset and 0.84 on the Russian dataset. Our findings underscore that (1) monolingual BERT models demonstrate superior performance for English, whereas multilingual BERT models perform better for Russian; (2) pretrained emotion detection models proving less effective for this specific task compared to models with reduced vocabulary and embeddings focused on specific languages;(3) exclusive use of BERT-based models, without incorporating additional methods or optimization techniques, demonstrates promising results for multilabel emotion detection.

pdf bib abs
CHILL at SemEval-2025 Task 2: You Can’t Just Throw Entities and Hope—Make Your LLM to Get Them Right
Jaebok Lee | Yonghyun Ryu | Seongmin Park | Yoonjung Choi

In this paper, we describe our approach for the SemEval 2025 Task 2 on Entity-Aware Machine Translation (EA-MT).Our system aims to improve the accuracy of translating named entities by combining two key approaches: Retrieval Augmented Generation (RAG) and iterative self-refinement techniques using Large Language Models (LLMs).A distinctive feature of our system is its self-evaluation mechanism, where the LLM assesses its own translations based on two key criteria: the accuracy of entity translations and overall translation quality. We demonstrate how these methods work together and effectively improve entity handling while maintaining high-quality translations.

pdf bib abs
AlexUNLP-NB at SemEval-2025 Task 1: A Pipeline for Idiom Disambiguation and Visual Representation
Mohamed Badran | Youssof Nawar | Nagwa El - Makky

This paper describes our system developed for SemEval-2025 Task 1, subtask A. This sharedsubtask focuses on multilingual idiom recognition and the ranking of images based on howwell they represent the sense in which a nominal compound is used within a given contextual sentence. This study explores the use of a pipeline, where task-specific models are sequentially employed to address each problem step by step. The process involves three key steps: first, identifying whether idioms are in their literal or figurative form; second, transforming them if necessary; and finally, usingthe final form to rank the input images.

pdf bib abs
RACAI at SemEval-2025 Task 7: Efficient adaptation of Large Language Models for Multilingual and Crosslingual Fact-Checked Claim Retrieval
Radu - Gabriel Chivereanu | Dan Tufis

The paper details our approach to SemEval 2025 Shared Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval.We investigate how large language models (LLMs) designed for general-purpose retrieval via text-embeddings can be adapted for fact-checked claim retrieval across multiple languages, including scenarios where the query and fact-check are in different languages. The experiments involve fine-tuning with a contrastive objective, resulting in notable gains in both accuracy and efficiency over the baseline retrieval model. We evaluate cost-effective techniques such as LoRA and QLoRA and Prompt Tuning.Additionally, we demonstrate the benefits of Matryoshka embeddings in minimizing the memory footprint of stored embeddings, reducing the system requirements for a fact-checking system.

pdf bib abs
FII the Best at SemEval 2025 Task 2: Steering State-of-the-art Machine Translation Models with Strategically Engineered Pipelines for Enhanced Entity Translation
Delia - Iustina Grigorita | Tudor - Constantin Pricop | Sergio - Alessandro Suteu | Daniela Gifu | Diana Trandabat

Entity-Aware Machine Translation (EAMT) aims to enhance the accuracy of machine translation (MT) systems in handling named entities, including proper names, domain-specific terms, and structured references. Conventional MT models often struggle to accurately translate these entities, leading to errors that affect comprehension and reliability. In this paper, we present a promising approach for SemEval 2025 Task 2, focusing on improving EAMT in ten target languages. The methodology is based on two complementary strategies: (1) multilingual Named Entity Recognition (NER) and structured knowledge bases for preprocessing and integrating entity translations, and (2) large language models (LLMs) enhanced with optimized prompts and validation mechanisms to improve entity preservation. By combining structured knowledge with neural approaches, this system aims to mitigate entity-related translation errors and enhance the overall performance of MT models. Among the systems that do not use gold information, retrieval-augmented generation (RAG), or fine-tuning, our approach ranked 1st with the second strategy and 3rd with the first strategy.

This paper presents the ZJUKLAB team’s submission for {emph{SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models}}. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model.Our system achieves competitive results, ranking {textbf{second among 26 teams}}, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method.Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research.

This paper describes our approach to address the SemEval-2025 Task 10 subtask 3, which is focused on narrative extraction given news articles with a dominant narrative. We design an external knowledge injection approach to fine-tune a Flan-T5 model so the generated narrative explanations are in line with the dominant narrative determined in each text. We also incorporate pragmatic information in the form of communicative intentions, using them as external knowledge to assist the model. This ensures that the generated texts align more closely with the intended explanations and effectively convey the expected meaning. The results show that our approach ranks 3rd in the task leaderboard (0.7428 in Macro-F1) with concise and effective news explanations. The analyses highlight the importance of adding pragmatic information when training systems to generate adequate narrative extractions.

pdf bib abs
MRS at SemEval-2025 Task 11: A Hybrid Approach for Bridging the Gap in Text-Based Emotion Detection
Milad Afshari | Richard Frost | Samantha Kissel | Kristen Johnson

We tackle the challenge of multi-label emotion detection in short texts, focusing on SemEval-2025 Task 11 Track A. Our approach, RoEmo, combines generative and discriminative models in an ensemble strategy to classify texts into five emotions: anger, fear, joy, sadness, and surprise.The generative model, instruction-finetuned on emotion detection datasets, undergoes additional fine-tuning on the SemEval-2025 Task 11 Track A dataset to enhance its performance for this specific task. Meanwhile, the discriminative model, based on binary classification, offers a straightforward yet effective approach to classification.We review recent advancements in multi-label emotion detection and analyze the task dataset. Our results show that RoEmo ranks among the top-performing systems, demonstrating high accuracy and reliability.

pdf bib abs
cocoa at SemEval-2025 Task 10: Prompting vs. Fine-Tuning: A Multilevel Approach to Propaganda Classification
Vineet Saravanan | Steven Wilson

The increasing sophistication of natural language processing models has facilitated advancements in hierarchical text classification, particularly in the domain of propaganda detection. This paper presents our submission to SemEval 2025 Task 10, Subtask 1, which focuses on multilevel text classification for identifying and categorizing propaganda narratives in online news. We investigate two primary approaches: (1) prompt-based classification using large language models (LLMs) like GPT, which offers flexibility but struggles with hierarchical categorization, and (2) fine-tuning transformer-based models, where we employ a hierarchical structure—one model classifies the main propaganda category, followed by three separate models specializing in subcategory classification. Our results indicate that while LLMs demonstrate some generalization ability, fine-tuned models significantly outperform them in accuracy and reliability, reinforcing the importance of task-specific supervised learning for propaganda detection. Additionally, we discuss challenges related to data sparsity in subclassification and explore potential enhancements such as multi-task learning and hierarchical loss functions. Our findings contribute to the broader field of automated propaganda detection and emphasize the value of structured classification models in combating misinformation. All code and data used in our experiments will be made publicly available on our GitHub

pdf bib abs
CICL at SemEval-2025 Task 9: A Pilot Study on Different Machine Learning Models for Food Hazard Detection Challenge
Weiting Wang | Wanzhao Zhang

This paper describes our approaches to SemEval-2025 task 9, a multiclass classification task to detect food hazards and affected products, given food incident reports from web resources. The training data consists of the date of the incidents and the text of the incident reports, as well as the labels: “hazard-category” and “product-category” for task 1, “hazard” and “product” for task 2. We primarily focused on solving task 1 of this challenge. Our approach is in two directions: Firstly, we fine-tuned BERT-based models (BERT and ModernBERT); secondly, in addition to BERT-based models, linearSVC, random forest classifier, and LightGBM were also used to tackle the challenge. From the experiment, we have learned that BERT-based models outperformed the other models mentioned above, and applying focal loss to BERT-based models optimized their performance on imbalanced classification tasks.

pdf bib abs
Hallucination Detectives at SemEval-2025 Task 3: Span-Level Hallucination Detection for LLM-Generated Answers
Passant Elchafei | Mervat Abu - Elkheir

Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role’s semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.

pdf bib abs
iLostTheCode at SemEval-2025 Task 10: Bottom-up Multilevel Classification of Narrative Taxonomies
Lorenzo Concas | Manuela Sanguinetti | Maurizio Atzori

This paper describes the approach used to address the task of narrative classification, which has been proposed as a subtask of Task 10 on Multilingual Characterization and Extraction of Narratives from Online News at the SemEval 2025 campaign. The task consists precisely in assigning all relevant sub-narrative labels from a two-level taxonomy to a given news article in multiple languages (i.e., Bulgarian, English, Hindi, Portuguese and Russian). This involves performing both multi-label and multi-class classification. The model developed for this purpose uses multiple pretrained BERT-based models to create contextualized embeddings that are concatenated and then fed into a simple neural network to compute classification probabilities. Results on the official test set, evaluated using samples $F_1$, range from $0.15$ in Hindi (rank #9) to $0.41$ in Russian (rank #3). Besides an overview of the system and the results obtained in the task, the paper also includes some additional experiments carried out after the evaluation phase along with a brief discussion of the observed errors.

pdf bib abs
Pixel Phantoms at SemEval-2025 Task 11: Enhancing Multilingual Emotion Detection with a T5 and mT5-Based Approach
Jithu Morrison S | Janani Hariharakrishnan | Harsh Pratap Singh

Emotion recognition in textual data is a crucial NLP task with applications in sentiment analysis and mental health monitoring. SemEval 2025 Task 11 introduces a multilingual dataset spanning 28 languages, including low-resource ones, to improve cross-lingual emotion detection. Our approach utilizes T5 for English and mT5 for other languages, fine-tuning them for multi-label classification and emotion intensity estimation. Our findings demonstrate the effectiveness of transformer-based models in capturing nuanced emotional expressions across diverse languages.

pdf bib abs
TableWise at SemEval-2025 Task 8: LLM Agents for TabQA
Harsh Bansal | Aman Raj | Akshit Sharma | Parameswari Krishnamurthy

Tabular Question Answering (TabQA) is a challenging task that requires models to comprehend structured tabular data and generate accurate responses based on complex reasoning. In this paper, we present our approach to SemEval Task 8: Tabular Question Answering, where we develop a large language model (LLM)-based agent capable of understanding and reasoning over tabular inputs. Our agent leverages a hybrid retrieval and generation strategy, incorporating structured table parsing, semantic understanding, and reasoning mechanisms to enhance response accuracy. We fine-tune a pre-trained LLM on domain-specific tabular data, integrating chain-of-thought prompting and adaptive decoding to improve multi-hop reasoning over tables. Experimental results demonstrate that our approach achieves competitive performance, effectively handling numerical operations, entity linking, and logical inference. Our findings suggest that LLM-based agents, when properly adapted, can significantly advance the state of the art in tabular question answering.

pdf bib abs
madhans476 at SemEval-2025 Task 9: Multi-Model Ensemble and Prompt-Based Learning for Food Hazard Prediction
Madhan S | Gnanesh R | Gopal D | Sunil Saumya

This paper presents a hybrid approach to food hazard detection for SemEval-2025 Task 9, combining traditional machine learning with advanced language models. For hazard classification (Sub-Task 1), we implemented a novel ensemble system integrating XGBoost with fine-tuned GPT-2 Large and LLaMA 3.1 1B models. For vector detection (Sub-Task 2), we employed a prompt-engineered approach using Flan-T5-XL, highlighting challenges in exact vector matching. Our analysis demonstrates the effectiveness of combining complementary models while revealing opportunities for improvement in rare category detection and extraction precision.

pdf bib abs
UniBuc-AE at SemEval-2025 Task 7: Training Text Embedding Models for Multilingual and Crosslingual Fact-Checked Claim Retrieval
Alexandru Enache

This paper describes our approach to the SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval on both the monolingual and crosslingual tracks. Our training methodology for text embedding models combines contrastive pre-training and hard negatives mining in order to fine-tune models from the E5 family. Additionally, we introduce a novel approach for merging the results from multiple models by finding the best majority vote weighted configuration for each subtask using the validation dataset. Our team ranked 6th in the monolingual track scoring a 0.934 S@10 averaged over all languages and achieved a 0.79 S@10 on the crosslingual task, ranking 8th in this track.

Emotion detection in multilingual settings presents significant challenges, particularly for low-resource languages where labeled datasets are scarce. To address these limitations, we introduce EmoRationale, a Retrieval-Augmented Generation (RAG) framework designed to enhance explainability and cross-lingual generalization in emotion detection. Our approach combines vector-based retrieval with in-context learning in large language models (LLMs), using semantically relevant examples to enhance classification accuracy and interpretability. Unlike traditional fine-tuning methods, our system provides evidence-based reasoning for its predictions, making emotion detection more transparent and adaptable across diverse linguistic contexts. Experimental results on the SemEval-2025 Task 11 dataset demonstrate that our RAG-based method achieves strong performance in multi-label emotion classification, emotion intensity assessment, and cross-lingual emotion transfer, surpassing conventional models in interpretability while remaining cost-effective.

pdf bib abs
STFXNLP at SemEval-2025 Task 11 Track A: Neural Network, Schema, and Next Word Prediction-based Approaches to Perceived Emotion Detection
Noah Murrant | Samantha Brooks | Milton King

In this work, we discuss our models that were applied to the SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection {cite{muhammad-etal-2025-semeval}}. We focused on the English data set of track A, which involves determining what emotions the reader of a snippet of text is feeling. We applied three different types of models that vary in their approaches and reported our findings on the task’s test set. We found that the performance of our models differed from each other, but neither of our models outperformed the task’s baseline model.

This paper tackles SemEval~2025 Task~10, “Multilingual Characterization and Extraction of Narratives from Online News,” focusing on the Ukraine-Russia War and Climate Change domains. Our approach covers three subtasks: (1) {textbf{Entity Framing}}, assigning protagonist-antagonist-innocent roles with a prompt-based Llama~3.1~(8B) method; (2) {textbf{Narrative Classification}}, a multi-label classification using XLM-RoBERTa-base; and (3) {textbf{Narrative Extraction}}, generating concise, text-grounded explanations via FLAN-T5. Results show a unified multilingual transformer pipeline, combined with targeted preprocessing and fine-tuning, achieves substantial gains over baselines while effectively capturing complex narrative structures despite data imbalance and varied label distributions.

pdf bib abs
LATE-GIL-NLP at SemEval-2025 Task 11: Multi-Language Emotion Detection and Intensity Classification Using Transformer Models with Optimized Loss Functions for Imbalanced Data
Jesús V á z q u e z - O s o r i o | Helena Gómez - Adorno | Gerardo Sierra | Vladimir Sierra - Casiano | Diana Canchola - Hernández | José Tovar - Cortés | Roberto Solís - Vilchis | Gabriel Salazar

This paper addresses our approach to Task 11 (Track A and B) at the SemEval-2025, which focuses on the challenge of multilingual emotion detection in text, specifically identifying perceived emotions. The task is divided into tracks, we participated in two tracks: Track A, involving multilabel emotion detection, and Track B, which extends this to predicting emotion intensity on an ordinal scale. Addressing the challenges of imbalanced data and linguistic diversity, we propose a robust approach using pre-trained language models, fine-tuned with techniques such as extensive and deep hyperparameter optimization, along with loss function combinations to improve performance on imbalanced datasets and underrepresented languages. Our results demonstrate strong performance on Track A, particularly in low-resource languages such as Tigrinya (ranked 2nd), Igbo (ranked 3rd), and Oromo (ranked 4th). This work offers a scalable framework for emotion detection with applications in cross-cultural communication and human-computer interaction.

pdf bib abs
HTU at SemEval-2025 Task 11: Divide and Conquer - Multi-Label emotion classification using 6 DziriBERTs submodels with Label-fused Iterative Mask Filling technique for low-resource data augmentation.
Abdallah Saleh | Mariam Biltawi

In this paper, the authors address the challenges of multi-label emotion detection in the Algerian dialect by proposing a novel Label-fused Iterative Mask Filling (L-IMF) data augmentation technique combined with a multi-model architecture. The approach leverages DziriBERT, a BERT variant pre-trained on Algerian text, to generate contextually and label-sensitive aug- mented data, mitigating class imbalance while preserving label consistency. The proposed method uses six independent classifiers, each trained on balanced datasets for dedicated la- bel, to improve performance. The results show significant improvement on mutli-label classification task using Deep Learning, with an F1 macro score of 0.536 on the validation dataset and 0.486 on the test dataset, the sys- tem ranked 28/41 on the Algerian dialect score- board; which is more than 7% higher than the task baseline using RemBERT.

pdf bib abs
BlueToad at SemEval-2025 Task 3: Using Question-Answering-Based Language Models to Extract Hallucinations from Machine-Generated Text
Michiel Pronk | Ekaterina Kamyshanova | Thijmen Adam | Maxim Van Der Maesen De Sombreff

Hallucination in machine-generated text poses big risks in various domains, such as finance, medicine, and engineering. Task 3 of SemEval-2025, Mu-SHROOM, challenges participants to detect hallucinated spans in such text. Our approach uses pre-trained language models and fine-tuning strategies to enhance hallucination spam detection, focusing on the English track. Firstly, we applied GPT-4o mini to generate synthetic data by labeling unlabeled data. Then, we employed encoder-only pre-trained language models with a question-answering architecture for hallucination span detection, ultimately choosing XLM-RoBERTa for fine-tuning on multilingual data. This model appeared to be our best and ranked 18th and 22nd on the English track with 0.469 intersection-over-union and 0.441 correlation scores, respectively. It achieved promising results across multiple languages, surpassing baseline methods in 11 out of 13 languages, with Hindi having the highest scores of 0.645 intersection-over-union and 0.684 correlation coefficient. Our findings highlight the potential of a QA approach and using synthetic and multilingual data for hallucination span detection.

pdf bib abs
IASBS at SemEval-2025 Task 11: Ensembling Transformers for Bridging the Gap in Text-Based Emotion Detection
Mehrzad Tareh | Erfan Mohammadzadeh | Aydin Mohandesi | Ebrahim Ansari

In this paper, we address the challenges of text-based emotion detection, focusing on multi-label classification, emotion intensity prediction, and cross-lingual emotion detection across various languages. We explore the use of advanced machine learning models, particularly transformers, in three tracks: emotion detection, emotion intensity prediction, and cross-lingual emotion detection. Our approach utilizes pre-trained transformer models, such as Gemini, DeBERTa, M-BERT, and M-DistilBERT, combined with techniques like majority voting and average ensemble voting (AEV) to enhance performance. We also incorporate multilingual strategies and prompt engineering to effectively handle the complexities of emotion detection across diverse linguistic and cultural contexts. Our findings demonstrate the success of ensemble methods and multilingual models in improving the accuracy and generalization of emotion detection, particularly for low-resource languages.

pdf bib abs
SBU-NLP at SemEval-2025 Task 8: Self-Correction and Collaboration in LLMs for Tabular Question Answering
Rashin Rahnamoun | Mehrnoush Shamsfard

This paper explains the submission of the SBU-NLP team at SemEval-2025 Task 8: question-answering over tabular data. We present a novel algorithm for this task, aimed at systems capable of interpreting large tables and providing accurate answers to natural language queries. The evaluation uses the DataBench dataset, which covers a wide range of topics and reflects the complexity of real-world tabular data. Our approach incorporates a self-correction mechanism that iteratively refines LLM-generated code to address errors and prevent common mistakes. Additionally, a multi-LLM collaborative strategy is employed to generate answers, where responses from multiple LLMs are compared, and the majority consensus or a valid alternative is selected. The method relies exclusively on open-source models, avoiding costly processes like training or fine-tuning. Experimental results demonstrate that combining multiple LLMs with self-correction leads to significant performance improvements. However, challenges arise with list-based answers and responses involving multiple numerical, string, or boolean values, where further refinement is needed. The proposed simple system was among the top performers in both Subtask A and Subtask B among open-source models in the competition.

pdf bib abs
Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval
Shujauddin Syed | Ted Pedersen

This paper presents our approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher resource languages with large performance gaps compared to the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited resource scenarios.

pdf bib abs
BitsAndBites at SemEval-2025 Task 9: Improving Food Hazard Detection with Sequential Multitask Learning and Large Language Models
Aurora Gensale | Irene Benedetto | Luca Gioacchini | Luca Cagliero | Alessio Bosca

Automatic and early detection of foodborne hazards is crucial for preventing outbreaks. Existing AI-based solutions often struggle with the complexity and noise of food recall reports and overcome the dependency between product and hazard labels. We introduce a methodology to classify reports on food-related incidents to address these challenges. Our approach leverages LLM-based information extraction to minimize report variability, alongside a two-stage classification pipeline. The first model assigns coarse-grained labels, narrowing the space of eligible fine-grained labels for the second model. This sequential process allows us to capture hierarchical label dependencies between products and hazards and their respective categories. Additionally, we design each model with two classification heads relying on the inherent relations between food products and associated hazards. We validate our approach on two multi-label classification sub-tasks. Experimental results demonstrate the effectiveness of our approach, achieving an improvement of +30% and +40% in classification performance compared to the baseline.

pdf bib abs
G-MACT at SemEval-2025 Task 8: Exploring Planning and Tool Use in Question Answering over Tabular Data
Wei Zhou | Mohsen Mesgar | Annemarie Friedrich | Heike Adel

This paper describes our system submitted to SemEval-2024 Task 8 “Question Answering over Tabular Data.”The shared task focuses on tackling real-life table question answering (TQA) involving extremely large tables with the additional challenges of interpreting complex questions. To address these issues, we leverage a framework of Multi-Agent Collaboration with Tool use (MACT), a method that combines planning and tool use. The planning module breaks down a complex question by designing a step-by-step plan. This plan is translated into Python code by a coding model, and a Python interpreter executes the code to generate an answer. Our system demonstrates competitive performance in the shared task and is ranked 5th out of 38 in the open-source model category. We provide a detailed analysis of our model, evaluating the effectiveness and the efficiency of each component, and identify common error patterns. Our paper offers essential insights and recommendations for future advancements in developing TQA systems.

pdf bib abs
UMUTeam at SemEval-2025 Task 1: Leveraging Multimodal and Large Language Model for Identifying and Ranking Idiomatic Expressions
Ronghao Pan | Tomás Bernal - Beltrán | José Antonio García - Díaz | Rafael Valencia - García

Idioms are non-compositional linguistic expressions whose meanings cannot be directly inferred from the individual words that compose them, posing significant challenges for natural language processing systems. This paper describes the participation of the UMUTeam in Subtask A of the AdMIRe shared task (SemEval 2025), which focuses on understanding idiomatic expressions through visual and contextual representations in English and Portuguese. Specifically, the task involves ranking a set of images according to how well they represent the sense of a potentially idiomatic nominal compound within a given contextual sentence. To address this challenge, we adopted a multimodal approach that combines textual and visual features using pre-trained language models, such as BERT and XLM-RoBERTa, along with Vision Transformers. Additionally, we explored the in-context learning capabilities of Large Language Models (LLMs), particularly Llama-3.1-8B, for image classification. These models are trained using a regression approach to rank images according to their semantic alignment with the contextual meaning of idioms. The results show that the Llama-3.1-8B model performs best for English, ranking 32 out of 36, while the XLM + ViT model is more effective for Portuguese, ranking 21 out of 24.

pdf bib abs
UMUTeam at SemEval-2025 Task 3: Detecting Hallucinations in Multilingual Texts Using Encoder-only Models Guided by Large Language Models
Ronghao Pan | Tomás Bernal - Beltrán | José Antonio García - Díaz | Rafael Valencia - García

Large Language Models like GPT-4, LLaMa, Mistral, and Gemma have revolutionized Natural Language Processing, advancing language comprehension, generation, and reasoning. However, they also present challenges, particularly the tendency to hallucinate—that is, to produce false or fabricated information. This paper presents our participation in Task 3 Mu-SHROOM of SemEval 2025, which focuses on detecting hallucinations in multilingual contexts. Specifically, the task requires identifying text segments generated by LLMs that correspond to hallucinations and calculating the hallucination probability for each character in the text. To address this challenge, we adopted a token classification approach using the pre-trained XLM-RoBERTa-large model, fine-tuned on the provided training set. Additionally, we integrated context from Llama-3.1-70B to enhance hallucination detection by leveraging its broader and more up-to-date knowledge base. Our approach combines the multilingual capability of XLM-RoBERTa with the contextual understanding of Llama-3.1-70B, producing a detailed hallucination probability for each character in the text. The results demonstrate that our approach consistently outperforms baseline methods across multiple languages, particularly in detecting token-level hallucinations.

pdf bib abs
UMUTeam at SemEval-2025 Task 7: Multilingual Fact-Checked Claim Retrieval with XLM-RoBERTa and Self-Alignment Pretraining Strategy
Ronghao Pan | Tomás Bernal - Beltrán | José Antonio García - Díaz | Rafael Valencia - García

In today’s digital age, the rapid dissemination of information through social networks poses significant challenges in verifying the veracity of shared content. The proliferation of misinformation can have serious consequences, influencing public opinion, policy decisions, and social dynamics. Fact-checking plays a critical role in countering misinformation; however, the manual verification process is time-consuming, especially when dealing with multilingual content. This paper presents our participation in the Multilingual and Crosslingual Fact-Checked Claim Retrieval task (SemEval 2025), which seeks to identify previously fact-checked claims relevant to social media posts. Our proposed system leverages XLM-RoBERTa, a multilingual Transformer model, combined with metric learning and hard negative mining strategies, to optimize the semantic comparison of posts and fact-checks across multiple languages. By fine-tuning a shared embedding space and employing a multiple similarity loss function, our approach enhances retrieval accuracy while maintaining efficiency. Evaluation results demonstrate competitive performance across multiple languages, reaching 25th place and highlighting the potential of multilingual NLP models in automating the fact-checking process and mitigating misinformation spread.

pdf bib abs
Lazarus NLP at SemEval-2025 Task 11: Fine-Tuning Large Language Models for Multi-Label Emotion Classification via Sentence-Label Pairing
Wilson Wongso | David Setiawan | Ananto Joyoadikusumo | Steven Limcorn

Multi-label emotion classification in low-resource languages remains challenging due to limited annotated data and model adaptability. To address this, we fine-tune large language models (LLMs) using a sentence-label pairing approach, optimizing efficiency while improving classification performance. Evaluating on Sundanese, Indonesian, and Javanese, our method outperforms conventional classifier-based fine-tuning and achieves strong zero-shot cross-lingual transfer. Notably, our approach ranks first in the Sundanese subset of SemEval-2025 Task 11 Track A. Our findings demonstrate the effectiveness of LLM fine-tuning for low-resource emotion classification, underscoring the importance of tailoring adaptation strategies to specific language families in multilingual contexts.

pdf bib abs
RSSN at SemEval-2025 Task 11: Optimizing Multi-Label Emotion Detection with Transformer-Based Models and Threshold Tuning
Ravindran V | Rajalakshmi Sivanaiah | Angel Deborah S

Our study explores multi-label emotion classification using fine-tuned BERT models, achieving superior performance over traditional methods such as logistic regression. The intricate nature of overlapping emotional expressions in text necessitates a robust classification framework. Fine-tuning BERT with weighted binary cross-entropy loss enhances predictive accuracy, particularly for underrepresented emotions like anger and joy. Moreover, threshold optimization plays a pivotal role in refining decision boundaries, boosting recall, and increasing the macro F1-score. Comparative analysis against RoBERTa and XGBoost further underscores the effectiveness of contextual embeddings in capturing subtle emotional nuances. Despite these improvements, challenges such as class imbalance and inter-class confusion persist, highlighting the need for future advancements in ensemble learning, contrastive pretraining, and domain-adaptive fine-tuning.

pdf bib abs
Modgenix at SemEval-2025 Task 1: Context Aware Vision Language Ranking (CAViLR) for Multimodal Idiomaticity Understanding
Joydeb Mondal | Pramir Sarkar

This paper presents CAViLR, a hybrid multimodal approach for SemEval-2025 Task 1. Our methodintegrates CLIP as a baseline with a Mixture of Experts (MoE) framework that dynamically selectsexpert models such as Pixtral-12B and Phi-3.5 based on input context. The approach addresseschallenges in both image ranking and image sequence prediction, improving the alignment of visualand textual semantics. Experimental results demonstrate that our hybrid model outperforms individualmodels. Future work will focus on refining expert selection and enhancing disambiguation strategiesfor complex idiomatic expressions.

pdf bib abs
ABCD at SemEval-2025 Task 9: BERT-based and Generation-based models combine with advanced weighted majority soft voting strategy
Tai Le | Dang Thin

first submission to SemEval-2025 task 9 by ABCD team

pdf bib abs
Sakura at SemEval-2025 Task 2: Enhancing Named Entity Translation with Fine-Tuning and Preference Optimization
Alberto Poncelas | Ohnmar Htun

Translating name entities can be challenging, as it often requires real-world knowledge rather than just performing a literal translation. The shared task “Entity-Aware Machine Translation” in SemEval-2025 encourages participants to build machine translation models that can effectively handle the translation of complex named entities.In this paper, we propose two methods to improve the accuracy of name entity translation from English to Japanese. One approach involves fine-tuning the model on entries, or lists of entries, of the dictionary. The second technique focuses on preference optimization, guiding the model on which translation it should generate.

pdf bib abs
GOLDX at SemEval-2025 Task 11: RoBERTa for Text-Based Emotion Detection
Bill Clinton

This is an ensemble of RoBERTa models based approach to classify emotions from texts.

pdf bib abs
EMO-NLP at SemEval-2025 Task 11: Multi-label Emotion Detection in Multiple Languages Based on XLMCNN
Jing Li | Yucheng Xian | Xutao Yang

This paper describes the system implemented by the EMO-NLP team for track A of task 11 in SemEval-2025: Bridging the Gap in Text-Based Emotion Detection. The task focuses on multiple datasets covering 28 languages for multi-label emotion detection. Most of these languages are low-resource languages. To achieve this goal, we propose a multilingual multi-label emotion detection system called XLMCNN, which can perform multi-label emotion detection across multiple languages. To enable emotion detection in various languages, we utilize the pre-trained model XLM-RoberTa-large to obtain embeddings for the text in different languages. Subsequently, we apply a two-dimensional convolutional operation to the embeddings to extract text features, thereby enhancing the accuracy of multi-label emotion detection. Additionally, we assign weights to different emotion labels to mitigate the impact of uneven label distribution. In this task, we focus on nine languages, among which the Amharic language achieves the best performance with our system, ranking 21st out of 45 teams.

pdf bib abs
Anaselka at SemEval-2025 Task 9: Leveraging SVM and MNB for Detecting Food Hazard
Anwar Annas | Al Hafiz Siagian

Our system for the Sub-task 1 of SemEval-2025 Task 9 has been designed to tackle the complexities of identifying and categorizing food safety incidents from textual data. Through a rigorous experimental setup, we have developed a text classification solution that leveraged state-of-the-art techniques in data preprocessing, feature engineering, and model optimization.

pdf bib abs
MyMy at SemEval-2025 Task 9: A Robust Knowledge-Augmented Data Approach for Reliable Food Hazard Detection
Ben Phan | Jung - Hsien Chiang

Food hazard detection from web sources, including social media and official food agency websites, is crucial for mitigating economic and public health risks. However, challenges such as class imbalance and the need for transparent, explainable AI remain. To address these issues, we propose a Knowledge-Augmented Data approach using Retrieval-Augmented Generation (RAG) to improve food incident report classification in SemEval-2025 Task 9. Our method leverages domain-specific knowledge to enrich datasets and curate high-quality data, enhancing overall integrity. We hypothesize that knowledge-augmented data improves Macro-F1 scores, the primary evaluation metric. Our approach achieved a top-3 average ranking across both subtasks, demonstrating its effectiveness in advancing NLP applications for food safety and contributing to more reliable food hazard detection systems.

pdf bib abs
Tue-JMS at SemEval-2025 Task 11: KReLax: An Ensemble-Based Approach for Multilingual Emotion Detection and Addressing Data Imbalance
Jingyu Han | Megan Horikawa | Suvi Lehtosalo

Emotion detection research has primarily focused on English, leaving a gap for low-resource languages. To address this, we present KReLaX, a multilingual ensemble model for multi-label emotion detection, combining three BERT-based encoders with a weighted voting layer. Within the shared task, our system performed well in multi-label classification, ranking 3rd in Tatar and achieving strong results in Hindi, Russian, Marathi, and Spanish. In emotion intensity classification, we achieved 6th place in Amharic and Hausa. While our system struggled in the zero-shot track, it achieved 7th place in Indonesian. These results highlight both the potential and the challenges of multilingual emotion detection, emphasizing the need for improved generalization in low-resource settings.

This paper describes the participation of team QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: https://github.com/warmth27/SemEval2025_Task7.

We present the system developed by the Central China Normal University (CCNU) team for the SemEval-2025 shared task 8, which focuses on Question-Answering (QA) for tabular data. Our approach leverages multiple Large Language Models (LLMs), conducting tabular QA as code completion. Additionally, to improve its reliability, we introduce a two-stage corrections mechanism, in which we instruct the LLM to correct the code according to the judges of whether the code is executable and whether the answer obtained from executing the code is semantically consistent with the question. The experiment demonstrates that code correction works but answer correction does not. Finally, we discuss other unsuccessful approaches explored during our development process.

pdf bib abs
HalluRAG-RUG at SemEval-2025 Task 3: Using Retrieval-Augmented Generation for Hallucination Detection in Model Outputs
Silvana Abdi | Mahrokh Hassani | Rosalien Kinds | Timo Strijbis | Roman Terpstra

Large Language Models (LLMs) suffer from a critical limitation: hallucinations, which refers to models generating fluent but factually incorrect text. This paper presents our approach to hallucination detection in English model outputs as part of the SemEval-2025 Task 3 (Mu-SHROOM). Our method, HalluRAG-RUG, integrates Retrieval-Augmented Generation (RAG) using Llama-3 and prediction models using token probabilities and semantic similarity. We retrieved relevant factual information using a named entity recognition (NER)-based Wikipedia search and applied abstractive summarization to refine the knowledge base. The hallucination detection pipeline then used this retrieved knowledge to identify inconsistent spans in model-generated text. This result was combined with the results of two systems which identified hallucinations based on token probabilities and low-similarity sentences. Our system placed 33rd out of 41, performing slightly below the ‘mark all’ baseline but surpassing the ‘mark none’ and ‘neural’ baselines with an IoU of 0.3093 and a correlation of 0.0833.

pdf bib abs
SALT at SemEval-2025 Task 2: A SQL-based Approach for LLM-Free Entity-Aware-Translation
Tom Volker | Jan Pfister | Andreas Hotho

Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text.We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2.Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations.Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods.Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity.SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters.

pdf bib abs
AlexNLP-MO at SemEval-2025 Task 8: A Chain of Thought Framework for Question-Answering over Tabular Data
Omar Mokhtar | Minah Ghanem | Nagwa El - Makky

Table Question Answering (TQA) involves extracting answers from structured data using natural language queries, a challenging task due to diverse table formats and complex reasoning. This work develops a TQA system using the DataBench dataset, leveraging large language models (LLMs) to generate Python code in a zero-shot manner. Our approach is highly generic, relying on a structured Chain-of-Thought framework to improve reasoning and data interpretation. Experimental results demonstrate that our method achieves high accuracy and efficiency, making it a flexible and effective solution for real-world tabular question answering.

pdf bib abs
Team UBD at SemEval-2025 Task 11: Balancing Class and Task Importance for Emotion Detection
Cristian Paduraru

This article presents the systems used by Team UBD in Task 11 of SemEval-2025. We participated in all three sub-tasks, namely Emotion Detection, Emotion Intensity Estimation and Cross-Lingual Emotion Detection. In our solutions we make use of publicly available Language Models (LMs) already fine-tuned for the Emotion Detection task, as well as open-sourced models for Neural Machine Translation (NMT). We robustly adapt the existing LMs to the new data distribution, balance the importance of all emotions and classes and also use a custom sampling scheme.We present fine-grained results in all sub-tasks and analyze multiple possible sources for errors for the Cross-Lingual Emotion Detection sub-task.

This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.

pdf bib abs
Trans-Sent at SemEval-2025 Task 11: Text-based Multi-label Emotion Detection using Pre-Trained BERT Transformer Models
Zafar Sarif | Md Akhtar | Dr. Abhishek Das | Dipankar Das

We have introduced Trans-Sent, a Transformer-based model designed for multi-label emotion classification in SemEval-2025 Task 11. The model predicts perceived emotions such as joy, sadness, anger, fear, surprise, and disgust from text across seven languages, including Amharic, German, English, Hindi, Marathi, Russian, and Romanian. To handle data imbalance, the system incorporates preprocessing techniques, SMOTE oversampling, and feature engineering to enhance classification accuracy. The model was trained using the BRIGHTER and EthioEmo datasets, which contain diverse textual sources, such as social media, news, literature, and personal narratives. Traditional machine learning models, including Logistic Regression and Decision Trees, were tested but proved inadequate for multi-label classification due to their limited ability to capture contextual and semantic meaning. Fine-tuned BERT models demonstrated superior performance, with Russian achieving the highest ranking (9th overall), while languages with complex grammar, such as German and Amharic, performed lower. Future enhancements may include advanced data augmentation, cross-lingual learning, and multimodal emotion analysis to improve classification across different languages. Trans-Sent contributes to NLP by advancing multi-label emotion detection, particularly in underrepresented languages.

pdf bib abs
KostasThesis2025 at SemEval-2025 Task 10 Subtask 2: A Continual Learning Approach to Propaganda Analysis in Online News
Konstantinos Eleftheriou | Panos Louridas | John Pavlopoulos

In response to the growing challenge of propagandistic presence through online media inonline news, the increasing need for automated systems that are able to identify and classify narrative structures in multiple languages is evident. We present our approach to the SemEval-2025 Task 10 Subtask 2, focusing on the challenge of hierarchical multi-label, multi-class classification in multilingual news articles. We present methods to handle long articles with respect to how they are naturally structured in the dataset, propose a hierarchical classification neural network model with respect to the taxonomy, and a continual learning training approach that leverages cross-lingual knowledge transfer.

pdf bib abs
DUTJBD at SemEval-2025 Task 3: A Range of Approaches for Predicting Hallucination Generation in Models
Shengdi Yin | Zekun Wang | Liang Yang | Hongfei Lin

This paper details the various methods we explored.Thank you.

This paper presents our system developed for the SemEval-2025 Task 9: The Food Hazard Detection Challenge. The shared task’s objective is to evaluate explainable classification systems for classifying hazards and products in two levels of granularity from web-collected food recall incident reports. In this work, we propose text augmentation techniques as a way to improve poor performance in minority classes and compare their effect for each category on various transformer and machine learning models. We apply three word-level data augmentation techniques, namely synonym replacement, random word swapping, and contextual word insertion utilizing BERT. The resultsshow that transformer models tend to have a better overall performance. Meanwhile, a statistically significant improvement (P 0.05) was observed in the fine-grained categories when using BERT to compare the baseline model with the three augmented models, which achieved a 6% increase in correct predictions for minority hazard classes. This suggests that targeted augmentation of minority classes can improve the performance of transformer models.

pdf bib abs
Zhoumou at SemEval-2025 Task 1: Leveraging Multimodal Data Augmentation and Large Language Models for Enhanced Idiom Understanding
Yingzhou Zhao | Bowen Guan | Liang Yang | Hongfei Lin

This paper elaborates on my task content regarding Semeval 2025 Task 1 Subtask A. Please refer to it.

The DataBench shared task in the SemEval-2025 competition aims to tackle the problem of QA from data in tables. Given the diversity of the structure of tables, there are different approaches to retrieving the answer. Although Retrieval-Augmented Generation (RAG) is a viable solution, extracting relevant information from tables remains challenging. In addition, the table can be prohibitively large for direct integration into the LLM context. In this paper, we address QA over tabular data first by identifying relevant columns that might contain the answers, then the LLM generates answers by providing the context of the relevant columns, and finally, the LLM refines its answers. This approach secured us 7th place in the DataBench lite category.

pdf bib abs
NBF at SemEval-2025 Task 5: Light-Burst Attention Enhanced System for Multilingual Subject Recommendation
Baharul Islam | Nasim Ahmad | Ferdous Barbhuiya | Kuntal Dey

This paper presents a system for automated subject tagging in a bilingual academic setting. Our approach leverages a novel burst attention mechanism to enhance the alignment between article and subject embeddings, derived from a large cross-lingual subject corpus. By employing a margin-based loss with negative sampling, our resource-efficient model achieves competitive performance in both quantitative and qualitative evaluations. Experimental results demonstrate average recall rates of 32.24% on the full test set, along with robust performance on specialized subsets, making our system well-suited for large-scale subject recommendation tasks.

pdf bib abs
QMUL at SemEval-2025 Task 11: Explicit Emotion Detection with EmoLex, Feature Engineering, and Threshold-Optimized Multi-Label Classification
Angeline Wang | Aditya Gupta | Iran Roman | Arkaitz Zubiaga

SemEval 2025 Task 11 Track A explores the detection of multiple emotions in text samples. Our best model combined BERT (fine-tuned on an emotion dataset) predictions and engineered features with EmoLex words appended. Together, these were used as input to train a multi-layer perceptron. This achieved a final test set Macro F1 score of 0.56. Compared to only using BERT predictions, our system improves performance by 43.6%.

pdf bib abs
Team INSALyon2 at SemEval-2025 Task 10: A Zero-shot Agentic Approach to Text Classification
Mohamed - Nour Eljadiri | Diana Nurbakova

We present Team INSALyon2’s agentic approach to SemEval-2025 Task 10 Subtask 2, which focuses on the multi-label classification of narratives in news articles across five languages. Our system employs a zero-shot architecture where specialized Large Language Model (LLM) agents handle binary classification tasks for individual narrative/subnarrative labels, with a meta-agent aggregating these decisions into final multi-label predictions. Instead of fine-tuning on the dataset, we leverage AutoGen to orchestrate multiple GPT-based agents, each responsible for detecting specific narrative/subnarrative types in a modular framework. This agent-based approach naturally handles the challenge of multi-label classification by enabling parallel decisions across the two-level taxonomy. Experiments on the English subset demonstrate strong performance with our system achieving F1_macro_coarse = 0.513, F1_sample = 0.406, securing third place in the competition. Our findings show that zero-shot agentic approaches can be competitive in complex classification tasks.

pdf bib abs
Team INSAntive at SemEval-2025 Task 10: Hierarchical Text Classification using BERT
Yutong Wang | Diana Nurbakova | Sylvie Calabretto

In this paper, we propose a BERT-based hierarchical text classification framework to address the challenges of training multi-level classification tasks. As part of the SemEval-2025 Task 10 challenge (Subtask 2), the framework performs fine-grained text classification by training dedicated sub-category classifiers for each top-level category. Experimental results demonstrate the feasibility of the proposed approach in multi-class text classification tasks.

pdf bib abs
MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection
Baraa Hikal | Ahmed Nasreldin | Ali Hamdi

This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, a fuzzy matching algorithm is utilized to improve span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.

pdf bib abs
SRCB at SemEval-2025 Task 9: LLM Finetuning Approach based on External Attention Mechanism in The Food Hazard Detection
Yuming Zhang | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong

This paper reports on the performance of SRCB’s system in SemEval-2025 Task 9: The Food Hazard Detection Challenge. We develop a system in the form of a pipeline consisting of two parts: 1. Candidate Recall Module, which selects the most probable correct labels from a large number of labels based on BERT model; 2. LLM Prediction Module, which is used to generate the final prediction based on Large Language Models(LLM). Additionally, to address the issue of long prompts caused by an excessive number of labels, we propose a model architecture to reduce resource consumption and improve performance. Our submission achieves the macro-F1 score of 80.39 on Sub-Task 1 and the macro-F1 score of 54.73 on Sub-Task 2. Our system is released at https://github.com/Doraxgui/Document_Attention

pdf bib abs
Emotion Train at SemEval-2025 Task 11: Comparing Generative and Discriminative Models in Emotion Recognition
Anastasiia Demidova | Injy Hamed | Teresa Lynn | Thamar Solorio

The emotion recognition task has become increasingly popular as it has a wide range of applications in many fields, such as mental health, product management, and population mood state monitoring. SemEval 2025 Task 11 Track A framed the emotion recognition problem as a multi-label classification task. This paper presents our proposed system submissions in the following languages: English, Algerian and Moroccan Arabic, Brazilian and Mozambican Portuguese, German, Spanish, Nigerian-Pidgin, Russian, and Swedish. Here, we compare the emotion-detecting abilities of generative and discriminative pre-trained language models, exploring multiple approaches, including curriculum learning, in-context learning, and instruction and few-shot fine-tuning. We also propose an extended architecture method with a feature fusion technique enriched with emotion scores and a self-attention mechanism. We find that BERT-based models fine-tuned on data of a corresponding language achieve the best results across multiple languages for multi-label text-based emotion classification, outperforming both baseline and generative models.

The widespread deployment of large language models (LLMs) across diverse domains has underscored the critical need to ensure the credibility and accuracy of their generated content, particularly in the presence of hallucinations. These hallucinations can severely compromise both the practical performance of models and the security of their applications. In response to this issue, SemEval-2025 Task 3 Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes introduces a more granular task for hallucination detection. This task seeks to identify hallucinations in text, accurately locate hallucinated segments, and assess their credibility. In this paper, we present a three-stage method for fine-grained hallucination detection and localization. First, we transform the text into a triplet representation, facilitating more precise hallucination analysis. Next, we leverage a large language model to generate fact-reference texts that correspond to the triplets. Finally, we employ a fact alignment strategy to identify and localize hallucinated segments by evaluating the semantic consistency between the extracted triplets and the generated reference texts. We evaluate our method on the unlabelled test set across all languages in Task 3, demonstrating strong detection performance and validating its effectiveness in multilingual contexts.

pdf bib abs
NLP_goats at SemEval-2025 Task 11: Multi-Label Emotion Classification Using Fine-Tuned Roberta-Large Tranformer
Vijay Karthick Vaidyanathan | Srihari V K | Mugilkrishna D U | Saritha Madhavan

This paper serves as a solution for multi-label emotion classification and intensity for text, developed for SemEval-2025 Task 11. The method uses a fine-tuned RoBERTa-Large transformer model. The system represents a multi-label classification approach to identifying multiple emotions, and uses regression models to estimate emotion strength. The model performed with ranks of 31st and 17th place in the corresponding tracks. The findings show impressive performance and it remains possible to improve the performance of ambiguous or low-frequency emotion recognition using the state-of-the-art contextual embeddings and threshold optimization techniques.

pdf bib abs
Firefly Team at SemEval-2025 Task 8: Question-Answering over Tabular Data using SQL/Python generation with Closed-Source Large Language Models
Nga Ho | Tuyen Ho | Hung Le | Dang Thin

In this paper, we describe our official system of the Firefly team for two main tasks in the SemEval-2025 Task 8: Question-Answering over Tabular Data. Our solution employs large language models (LLMs) to translate natural language queries into executable code, specifically Python and SQL, which are then used to generate answers categorized into five predefined types. Our empirical evaluation highlights the superiority of Python code generation over SQL for this challenge. Besides, the experimental results show that our system has achieved competitive performance in two subtasks. In Subtask I: Databench QA, where we rank the Top 9 across datasets of any size. Besides, our solution achieved competitive results and ranked 5th place in Subtask II: Databench QA Lite, where datasets are restricted to a maximum of 20 rows.

The Multilingual shared-task on Hallucinations and Related Observable Overgeneration Mistakes in the SemEval-2025 competition aims to detect hallucination spans in the outputs of instruction-tuned LLMs in a multilingual context. In this paper, we address the detection of span hallucinations by applying an ensemble of approaches. In particular, we synthesized a PsiloQA dataset and fine-tuned LLM to detect hallucination spans. In addition, we combined this approach with a white-box method based on uncertainty quantification techniques. Using our combined pipeline, we achieved 3rd place in detecting span hallucinations in Arabic, Catalan, Finnish, Italian, and ranked within the top ten for the rest of the languages.

This paper presents the method for the unlearning of sensitive information from large language models as applied in the SemEval 2025 Task 4 challenge. The unlearning pipeline consists of two phases. In phase I, the model is instructed to forget specific datasets, and in phase II, the model is stabilized using a retention dataset. Unlearning with these methods secured a final score of 0.420 with the 2nd honorary mention in the 7B parameter challenge and a score of 0.36 in the 13th position for the 1B parameter challenge. The paper presents a background study, a brief literature review, and a gap analysis, as well as the methodology employed in our work titled NeuroReset. The training methodology and evaluation metrics are also presented, and the trade-offs between unlearning efficiency and model performance are discussed. The contributions of the paper are systematic unlearning, a comparative analysis of unlearning methods, and an empirical analysis of model performance post-unlearning.

This paper introduces the participation of the QUST team in subtask 1 of SemEval-2025 Task 10. We evaluate various large language models (LLMs) based on instruction tuning (IT) on subtask 1. Specifically, we first analyze the data statistics, suggesting that the imbalance of label distribution made it difficult for LLMs to be fine-tuned. Subsequently, a voting mechanism is utilized on the predictions of the top-3 models to derive the final submission results. The team participated in all language tracks, achieving 1st place in Hindi (HI), 2nd in Russian (RU), 3rd in Portuguese (PT), 6th in Bulgarian (BG), and 7th in English (EN) on the official test set. We release our system code at: https://github.com/warmth27/SemEval2025_Task10

pdf bib abs
AGHNA at SemEval-2025 Task 11: Predicting Emotion and Its Intensity within a Text with EmoBERTa
Moh. Abyan

This paper presents our system that have been developed for SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. The system is able to do two sub-tasks: Track A, related to detecting emotion(s) in a given text; Track B, related to calculate intensity of emotion(s) in a given text. The system will have EmoBERTa as the model baseline, despite some minor differences used in the system approach between these tracks. With the system designed above, Track A achieved a Macro-F1 Score of 0.7372, while Track B achieved Average Pearson r Score of 0.7618.

pdf bib abs
TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification
Miriam Anschütz | Ekaterina Gikalo | Niklas Herbster | Georg Groh

Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the "{textit{SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes}}”. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

pdf bib abs
NITK-VITAL at SemEval-2025 Task 11: Focal-RoBERTa: Addressing Class Imbalance in Multi-Label Emotion Classification
Ashinee Kesanam | Gummuluri Venkata Ravi Ram | Chaithanya Swaroop Banoth | G Rama Mohana Reddy

This paper presents our approach to SemEval Task 11, which focuses on multi-label emotion detection in English textual data. We experimented with multiple methodologies, including traditional machine learning models, deep learning architectures, and transformer-based models. Our best-performing approach employed RoBERTa with focal loss, which effectively mitigated class imbalances and achieved a macro F1-score of 0.7563, outperforming other techniques. Comparative analyses between different embedding strategies, such as TF-IDF, BERT, and MiniLM, revealed that transformer-based models consistently provided superior performance. The results demonstrate the effectiveness of focal loss in handling highly skewed emotion distributions. Our system contributes to advancing multi-label emotion detection by leveraging robust pre-trained models and loss function optimization.

pdf bib abs
CharsiuRice at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Hiu Yan Yip | Hing Man Chiu | Hai - Yin Yang

This paper presents our participation in SemEval-2025 Task 11, which focuses on bridging the gap in text-based emotion detection. Our team took part in both Tracks A and B, addressing different aspects of emotion classification. We fine-tuned a RoBERTa base model on the provided dataset in Track A, achieving a Macro-F1 score of 0.7264. For Track B, we built on top of the Track A model by incorporating an additional non-linear layer, in the hope of enhancing Track A model’s understanding of emotion detection. Track B model resulted with an average Pearson’s R of 0.5658. The results demonstrate the effectiveness of fine-tuning in Track A and the potential improvements from architectural modifications in Track B for emotion intensity detection tasks.

pdf bib abs
Deloitte (Drocks) at SemEval-2025 Task 3: Fine-Grained Multi-lingual Hallucination Detection Using Internal LLM Weights
Alex Chandler | Harika Abburi | Sanmitra Bhattacharya | Edward Bowen | Nirmala Pudota

Large Language Models (LLMs) have greatly advanced the field of Natural Language Generation (NLG). Despite their remarkable capabilities, their tendency to hallucinate—producing inaccurate or misleading information-remains a barrier to wider adoption. Current hallucination detection methods mainly employ coarse-grained binary classification at the sentence or document level, overlooking the need for precise identification of the specific text spans containing hallucinations. In this paper, we proposed a methodology that generates supplementary context and processes text using an LLM to extract internal weights (features) from various layers. These extracted features serve as input for a neural network classifier designed to perform token-level binary detection of hallucinations. Subsequently, we map the resulting token-level predictions to character-level predictions, enabling the identification of spans of hallucinated text, which we refer to as hallucination spans. Our model achieved a top-ten ranking in 13 of the 14 languages and secured first place for the French language in the SemEval: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes (Mu-SHROOM), utilizing the Mu-SHROOM dataset provided by the task organizers.

pdf bib abs
ATLANTIS at SemEval-2025 Task 3 : Detecting Hallucinated Text Spans in Question Answering
Catherine Kobus | Francois Lancelot | Marion - Cecile Martin | Nawal Ould Amer

This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.

pdf bib abs
Dianchi at SemEval-2025 Task 11: Multilabel Emotion Recognition via Orthogonal Knowledge Distillation
Zhenlan Wang | Jiaxuan Liu

This paper presents KDBERT-MLDistill, a novel framework for multi-label emotion recognition developed for SemEval-2025 Task 11. Addressing challenges of fine-grained emotion misdetection and small-data overfitting, the method synergizes BERT-based text encoding with orthogonal knowledge distillation. Key innovations include: (1) Orthogonal regularization on classifier weights to minimize redundant feature correlations, coupled with dynamic pseudo-labeling for periodic data augmentation; (2) A hierarchical distillation mechanism where dual teacher-student models iteratively exchange parameters to balance knowledge retention and exploration.

pdf bib abs
zhouyijiang1 at SemEval-2025 Task 11: A Multi-tag Detection Method based on Pre-training Language Models
Zhou Jiang | Dengtao Zhang

In order to effectively predict the speaker’s informing emotion from text fragments, we propose a transfer learning framework based on the BERT pre-training model through deep semantic feature extraction and cascade structure of dynamic weight linear classifier. In the speaker informing emotion prediction task, a 0.70 F1 score is achieved, illustrating the effectiveness of cross-domain emotion recognition.

pdf bib abs
DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing
Lisa Kluge | Maximilian Kähler

This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library’s Open-Access Catalog.Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record.Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.

pdf bib abs
NYCU-NLP at SemEval-2025 Task 11: Assembling Small Language Models for Multilabel Emotion Detection and Intensity Prediction
Zhe - Yu Xu | Yu - Hsin Wu | Lung - Hao Lee

This study describes the design of the NYCU-NLP system for the SemEval-2025 Task 11 that focuses on multi-lingual text-based emotion analysis. We instruction-tuned three small language models: Gemma-2 (27B), Mistral-small-3 (22B), and Phi-4 (14B) and then assembled them as our main system architecture. Our NYCU-NLP system participated the English Track A for multilabel emotion detection and English Track B for emotion intensity prediction. Experimental results show our best-performing submission produced a macro-averaging F1 score of 0.8225, ranking second of 90 participating teams for Track A, and ranked second among 41 teams for Track B with a Pearson correlation coefficient of 0.8373.

This paper describes our system used in the SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. To address the highly subjective nature of emotion detection tasks, we propose a model ensemble strategy designed to capture the varying subjective perceptions of different users towards textual content. The base models of this ensemble strategy consist of several large language models, which are then combined using methods such as neural networks, decision trees, linear regression, and weighted voting. In Track A, out of 28 languages, our system achieved first place in 19 languages. In Track B, out of 11 languages, our system ranked first in 10 languages. Furthermore, our system attained the highest average performance across all languages in both Track A and Track B.

pdf bib abs
FENJI at SemEval-2025 Task 3: Retrieval-Augmented Generation and Hallucination Span Detection
Flor Alberts | Ivo Bruinier | Nathalie Palm | Justin Paetzelt | Erik Varecha

Large Language Models (LLMs) have significantly advanced Natural Language Processing, however, ensuring the factual reliability of these models remains a challenge, as they are prone to hallucination - generating text that appears coherent but contains innacurate or unsupported information. SemEval-2025 Mu-SHROOM focused on character-level hallucination detection in 14 languages. In this task, participants were required to pinpoint hallucinated spans in text generated by multiple instruction-tuned LLMs. Our team created a system that leveraged a Retrieval-Augmented Generation (RAG) approach and prompting a FLAN-T5 model to identify hallucination spans. Despite contradicting prior literature, our approach yielded disappointing results, underperforming all the “mark-all” baselines and failing to achieve competitive scores. Notably, removing RAG improved performance. The findings highlight that while RAG holds potential for hallucination detection, its effectiveness is heavily influenced by the retrieval component’s context-awareness. Enhancing the RAG’s ability to capture more comprehensive contextual information could improve performance across languages, making it a more reliable tool for identifying hallucination spans.

pdf bib abs
GUIR at SemEval-2025 Task 4: Adaptive Weight Tuning with Gradual Negative Matching for LLM Unlearning
Hrishikesh Kulkarni | Nazli Goharian | Ophir Frieder

Machine Unlearning for Large Language Models, referred to as LLM Unlearning is getting more and more attention as a result of regurgitation of sensitive and harmful content. In this paper, we present our method architecture, results, and analysis of our submission to Task4: Unlearning sensitive content from Large Language Models. This task includes three subtasks of LLM Unlearning on 1) Long Synthetic documents, 2) Short Synthetic documents, and 3) Real Training documents. Getting rid of the impact of undesirable and unauthorized responses is the core objective of unlearning. Furthermore, it is expected that unlearning should not have an adverse impact on the usability of the model. In this paper, we provide an approach for LLM unlearning that tries to make the model forget while maintaining usability of the model. We perform adaptive weight tuning with Gradient Ascent, KL minimization and Gradual Negative Matching loss functions. Our submission balances retain and forget abilities of the model while outperforming provided benchmarks.

pdf bib abs
ipezoTU at SemEval-2025 Task 7: Hybrid Ensemble Retrieval for Multilingual Fact-Checking
Iva Pezo | Allan Hanbury | Moritz Staudinger

Fact-check retrieval plays a crucial role in combating misinformation by ensuring that claims are accurately matched with relevant fact-checks. In this work, we present a hybrid retrieval pipeline that integrates lexical and semantic retrieval models, leveraging their complementary strengths. We evaluate different retrieval and reranking strategies, demonstrating that hybrid ensembling consistently outperforms individual models, while reranking provides only marginal improvements.

The SemEval-2025 Task 11 addresses multi-label emotion detection, classifying perceived emotions in text. Our system targets Amharic, a morphologically complex, low-resource language. We fine-tune LaBSE with class-weighted loss for multi-label prediction.Our architecture consists of: (i) text tokenization via LaBSE, (ii) a fully connected layer with sigmoid activation for classification, and (iii) optimization using BCEWithLogitsLoss and AdamW. Ablation studies on class balancing and data augmentation showed that simple upsampling did not improve performance, highlighting the need for more sophisticated techniques.Our system ranked 14th out of 43 teams, achieving 0.4938 accuracy, 0.6931 micro-F1, and 0.6450 macro-F1, surpassing the task baseline (0.6383 macro-F1). Error analysis revealed that anger and disgust were well detected, while fear and surprise were frequently misclassified due to overlapping linguistic cues. Our findings underscore the challenges of multi-label emotion detection in low-resource languages. Future work could explore context-aware embeddings, improved data augmentation, and adaptive loss functions.

pdf bib abs
OPI-DRO-HEL at SemEval-2025 Task 9: Integrating Transformer-Based Classification with LLM-Assisted Few-Shot Learning for Food Hazard Detection
Martyna Śpiewak | Daniel Karaś

In this paper, we propose a hybrid approach for food hazard detection that combines a fine-tuned RoBERTa classifier with few-shot learning using an LLM model (GPT-3.5-turbo). We address challenges related to unstructured text and class imbalance by applying class weighting and keyword extraction (KeyBERT, YAKE, and Sentence-BERT). When RoBERTa’s confidence falls below a given threshold, a structured prompt which comprising the title, extracted keywords, and a few representative examples is used to re-evaluate the prediction with ChatGPT.

pdf bib abs
Zero at SemEval-2025 Task 11: Multilingual Emotion Classification with BERT Variants: A Comparative Study
Revanth Gundam | Abhinav Marri | Radhika Mamidi

Emotion detection in text plays a very crucial role in NLP applications such as sentiment analysis and feedback analysis. This study tackles two tasks: multi-label emotion detection, where the goal is to classify text based on six emotions (joy, sadness, fear, anger, surprise, and disgust) in a multilingual setting, and emotion intensity prediction, which assigns an ordinal intensity score to each of the above-given emotions. Using the BRIGHTER dataset, a multilingual corpus spanning 28 languages, the paper addresses issues like class imbalances by treating each emotion as an independent binary classification problem. The paper first explores strategies such as static embeddings such as GloVe with logistic regression classifiers on top of it. To capture contextual nuances more effectively, we fine-tune transformer based models, such as BERT and RoBERTa. Our approach demonstrates the benefits of fine-tuning for improved emotion prediction, while also highlighting the challenges of multilingual and multi-label classification.

pdf bib abs
Zero at SemEval-2025 Task 2: Entity-Aware Machine Translation: Fine-Tuning NLLB for Improved Named Entity Translation
Revanth Gundam | Abhinav Marri | Advaith Malladi | Radhika Mamidi

Machine Translation (MT) is an essential tool for communication amongst people across different cultures, yet Named Entity (NE) translation remains a major challenge due to its rarity in occurrence and ambiguity. Traditional approaches, like using lexicons or parallel corpora, often fail to generalize to unseen entities, and hence do not perform well. To address this, we create a silver dataset using the Google Translate API and fine-tune the facebook/nllb200-distilled-600M model with LoRA (LowRank Adaptation) to enhance translation accuracy while also maintaining efficient memory use. Evaluated with metrics such as BLEU, COMET, and M-ETA, our results show that fine-tuning a specialized MT model improves NE translation without having to rely on largescale general-purpose models.

pdf bib abs
VerbaNexAI at SemEval-2025 Task 11 Track A: A RoBERTa-Based Approach for the Classification of Emotions in Text
Danileth Almanza | Juan Martínez Santos | Edwin Puertas

Emotion detection in text has become a highly relevant research area due to the growing interest in understanding emotional states from human interaction in the digital world. This study presents an approach for emotion detection in text using a RoBERTa-based model, optimized for multi-label classification of the emotions joy, sadness, fear, anger, and surprise in the context of the SemEval 2025 - Task 11: Bridging the Gap in Text-Based Emotion Detection competition. Advanced preprocessing strategies were incorporated, including the augmentation of the training dataset through automatic translation to improve the representativeness of less frequent emotions. Additionally, a loss function adjustment mechanism was implemented to mitigate class imbalance, enabling the model to enhance its detection capability for underrepresented categories. The experimental results reflect competitive performance, with a macro F1 of 0.6577 on the development set and 0.6266 on the test set. In the competition, the model ranked 47th, demonstrating solid performance against the challenge posed.

SemEval-2025 Task 1 introduces multimodal datasets for idiomatic expression representation. Subtask A focuses on ranking images based on potentially idiomatic noun compounds in given sentences. Idiom comprehension demands the fusion of visual and auditory elements with contextual semantics, yet existing datasets exhibit phrase-image discordance and culture-specific opacity, impeding cross-modal semantic alignment. To address these challenges, we propose an integrated approach that combines data augmentation and model fine-tuning in subtask A. First, we construct two idiom datasets by generating visual metaphors for idiomatic expressions to fine-tune the CLIP model. Next, We propose a three-stage multimodal chain-of-thought method, fine-tuning Qwen2.5-VL-7B-Instruct to generate rationales and perform inference, alongside zero-shot experiments with Qwen2.5-VL-72B-Instruct. Finally, we integrate the output of different models through a voting mechanism to enhance the accuracy of multimodal semantic matching. This approach achieves {textbf{0.92}} accuracy on the Portuguese test set and {textbf{0.93}} on the English test set, ranking {textbf{3rd}} and {textbf{4th}}, respectively. The implementation code is publicly available here{footnote{{url{ https://github.com/wyn1015/semeval}}}}.

pdf bib abs
Advacheck at SemEval-2025 Task 3: Combining NER and RAG to Spot Hallucinations in LLM Answers
Anastasia Voznyuk | German Gritsai | Andrey Grabovoy

The Mu-SHROOM competition in the SemEval-2025 Task 3 aims to tackle the problem of detecting spans with hallucinations in texts, generated by Large Language Models (LLMs). Our developed system, submitted to this task, is a joint architecture that utilises Named Entity Recognition (NER), Retrieval-Augmented Generation (RAG) and LLMs to gather, compare and analyse information in the texts provided by organizers. We extract entities potentially capable of containing hallucinations with NER, aggregate relevant topics for them using RAG, then verify and provide a verdict on the extracted information using the LLMs. This approach allowed with a certain level of quality to find hallucinations not only in facts, but misspellings in names and titles, which was not always accepted by human annotators in ground truth markup. We also point out some inconsistencies within annotators spans, that perhaps affected scores of all participants.

pdf bib abs
PALI-NLP at SemEval 2025 Task 1: Multimodal Idiom Recognition and Alignment
Runyang You | Xinyue Mei | Mengyuan Zhou

Understanding idioms in multimodal contexts poses significant challenges due to data scarcity, idiomatic ambiguity, and the need for effective alignment of visual and textual inputs. In this work, we introduce MIRA (Multimodal Idiom Recognition and Alignment), a training-free framework designed to address these challenges on the SemEval-2025 Task 1 (AdMIRe) benchmark. MIRA leverages powerful closed-source large language models (LLMs) and integrates three key innovations: bias correction via in-context learning, multi-step semantic-visual fusion, and a self-revision mechanism that iteratively refines its outputs through backward verification. By systematically processing and fusing multimodal inputs, MIRA generates high-quality, fine-grained image-text representations that enhance idiom comprehension across different languages and cultural contexts. Experimental evaluations in both English and Portuguese demonstrate that our approach achieves robust performance without the need for additional training, setting a new standard for multimodal idiom recognition.

pdf bib abs
UTBNLP at Semeval-2025 Task 11: Predicting Emotion Intensity with BERT and VAD-Informed Attention.
Melissa Moreno | Juan Martínez Santos | Edwin Puertas

Emotion intensity prediction plays a crucial role in affective computing, allowing for a more precise understanding of how emotions are conveyed in text. This study proposes a system that estimates emotion intensity levels by integrating contextual language representations with numerical emotion-based features derived from Valence, Arousal, and Dominance (VAD). The methodology combines BERT embeddings, predefined VAD values per emotion, and machine learning techniques to enhance emotion detection, without relying on external lexicons. The system was evaluated on the SemEval-2025 Task 11 Track B dataset, predicting five emotions (anger, fear, joy, sadness, and surprise) on an ordinal scale.The results highlight the effectiveness of integrating contextual representations with predefined VAD values, enabling a more nuanced representation of emotional intensity. However, challenges arose in distinguishing intermediate intensity levels, affecting classification accuracy for certain emotions. Despite these limitations, the study provides insights into the strengths and weaknesses of combining deep learning with numerical emotion modeling, contributing to the development of more robust emotion prediction systems. Future research will explore advanced architectures and additional linguistic features to enhance model generalization across diverse textual domains.

Question answering using Large Language Models has gained significant popularity inboth everyday communication and at the workplace. However, certain tasks, such as querying tables, still pose challenges for commercial and open-source chatbots powered by advanceddeep learning models. Addressing these challenges requires specialized approaches.During the SemEval-2025 Task 8 competition focused on tabular data, our solution achieved86.21% accuracy and took 2nd place out of 100 teams. In this paper we present ten methodsthat significantly improve the baseline solution. Our code is available as open-source.

pdf bib abs
OPI-DRO-HEL at at SemEval-2025 Task 11: Few-shot prompting for Text-based Emotion Recognition
Daniel Karaś | Martyna Śpiewak

This paper presents our system, developed as our contribution to SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection task, in particular track A, Multi-label Emotion Detection subtask. Our approach relies on two distinct components: semantic search for top N most similar inputs from training set and an interface to pretrained LLM being prompted using the found examples. We examine several prompting strategies and their impact on overall performance of the proposed solution.

pdf bib abs
Saama Technologies at SemEval-2025 Task 8: Few-shot prompting with LLM-generated examples for question answering on tabular data
Kamal Raj Kanakarajan | Hwanmun Kim | Malaikannan Sankarasubbu

For SemEval 2025 Task 8, addressing tabular data question answering, we introduce a novel few-shot prompting system that guides large language models (LLMs) to generate Python code representing the reasoning process. Our system automatically creates a library of exemplar code snippets from training data, which are then used for few-shot prompting. Crucially, we incorporate a selection prompt to choose the best candidate code from multiple LLM-generated options, improving robustness and accuracy. Our system achieved competitive results, ranking 17th in the Open Model track and 25th overall. Ablation studies demonstrate the effectiveness of our exemplar generation and code selection strategies. We conclude with a discussion of limitations and promising avenues for future research.

pdf bib abs
Tuebingen at SemEval-2025 Task 10: Class Weighting, External Knowledge and Data Augmentation in BERT Models
Özlem Karabulut | Soudabeh Eslami | Ali Gharaee | Matthew Andrews

The spread of disinformation and propaganda in online news presents a significant challengeto information integrity. As part of the SemEval 2025 Task-10 on Multilingual Characterization and Extraction of Narratives from Online News, this study focuses on Subtask 1: Entity Framing, which involves assigning roles to named entities within news articles across multiple languages.We investigate techniques such as data augmentation, external knowledge, and class weighting to improve classification performance. Our findings indicate that class weighting was more effective than other approaches

pdf bib abs
VerbaNexAI at SemEval-2025 Task 2: Enhancing Entity-Aware Translation with Wikidata-Enriched MarianMT
Daniel Peña Gnecco | Juan Carlos Martinez Santos | Edwin Puertas

This paper presents the VerbaNexAi Lab system for SemEval-2025 Task 2: Entity-Aware Machine Translation (EA-MT), focusing on translating named entities from English to Spanish across categories such as musical works, foods, and landmarks. Our approach integrates detailed data preprocessing, enrichment with 240,432 Wikidata entity pairs, and fine-tuning of the MarianMT model to enhance entity translation accuracy. Official results reveal a COMET score of 87.09, indicating high fluency, an M-ETA score of 24.62, highlighting challenges in entity precision, and an Overall Score of 38.38, ranking last among 34 systems. While Wikidata improved translations for common entities like “Águila de San Juan,” our static methodology underperformed compared to dynamic LLM-based approaches.

pdf bib abs
CSECU-Learners at SemEval-2025 Task 9: Enhancing Transformer Model for Explainable Food Hazard Detection in Text
Monir Ahmad | Md. Akram Hossain | Abu Nowshed Chy

Food contamination and associated illnesses represent significant global health challenges, leading to thousands of deaths worldwide. As the volume of food-related incident reports on web platforms continues to grow, there is a pressing demand for systems capable of detecting food hazards effectively. Furthermore, explainability in food risk detection is crucial for building trust in automated systems, allowing humans to validate predictions. SemEval-2025 Task 9 proposes a food hazard detection challenge to address this issue, utilizing content extracted from websites. This task is divided into two sub-tasks. Sub-task 1 involves classifying the type of hazard and product, while sub-task 2 focuses on identifying precise hazard and product “vectors” to offer detailed explanations for the predictions. This paper presents our participation in this task, where we introduce a transformer-based method. We fine-tune an enhanced version of the BERT transformer to process lengthy food incident reports. Additionally, we combine the transformer’s contextual embeddings to enhance its contextual representation for hazard and product “vectors” prediction. The experimental results reveal the competitive performance of our proposed method in this task. We have released our code at https://github.com/AhmadMonirCSECU/SemEval-2025_Task9.

pdf bib abs
NLP-DU at SemEval-2025 Task 11: Analyzing Multi-label Emotion Detection
Sadman Sakib | Ahaj Faiak | Abdullah Ibne Hanif Arean | Fariha Anjum Shifa

This paper describes NLP-DU’s entry to SemEval-2025 Task 11 on multi-label emotion detection. We investigated the efficacy of transformer-based models and propose an ensemble approach that combines multiple models. Our experiments demonstrate that the ensemble outperforms individual models under the dataset constraints, yielding superior performance on key evaluation metrics. These findings underscore the potential of ensemble techniques in enhancing multi-label emotion detection and contribute to the broader understanding of emotion analysis in natural language processing.

pdf bib abs
WordWiz at SemEval-2025 Task 10: Optimizing Narrative Extraction in Multilingual News via Fine-Tuned Language Models
Ruhollah Ahmadi | Hossein Zeinali

This paper presents our WordWiz system for SemEval-2025 Task 10: Narrative Extraction. We employed a combination of targeted preprocessing techniques and instruction-tuned language models to generate concise, accurate narrative explanations across five languages. Our approach leverages an evidence refinement strategy that removes irrelevant sentences, improving signal-to-noise ratio in training examples. We fine-tuned Microsoft’s Phi-3.5 model using both Supervised Fine-Tuning (SFT). During inference, we implemented a multi-temperature sampling strategy that generates multiple candidate explanations and selects the optimal response using narrative relevance scoring. Notably, our smaller Phi-3.5 model consistently outperformed larger alternatives like Llama-3.1-8B across most languages. Our system achieved significant improvements over the baseline across all languages, with F1 scores ranging from 0.7486 (Portuguese) to 0.6839 (Bulgarian), demonstrating the effectiveness of evidence-guided instruction tuning for narrative extraction.

pdf bib abs
LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA
Adrián López Gude | Roi Santos Ríos | Francisco Prado Valiño | Ana Ezquerro | Jesús Vilares

We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning.

pdf bib abs
AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection
Dimitra Karkani | Maria Lymperaiou | George Filandrianos | Nikolaos Spanos | Athanasios Voulodimos | Giorgos Stamou

Multilingual hallucination detection stands as an underexplored challenge, which the Mu-SHROOM shared task seeks to address. In this work, we propose an efficient, training-free LLM prompting strategy that enhances detection by translating multilingual text spans into English. Our approach achieves competitive rankings across multiple languages, securing two first positions in low-resource languages. The consistency of our results highlights the effectiveness of our translation strategy for hallucination detection, demonstrating its applicability regardless of the source language.

pdf bib abs
Dataground at SemEval-2025 Task 8: Small LLMs and Preference Optimization for Tabular QA
Giuseppe Attardi | Andrea Nelson Mauro | Daniele Sartiano

We present our submission to SemEval 2025 Task 8: Question Answering on Tabular Data, which challenges participants to develop systems capable of answering natural language questions on real-world tabular datasets. Our approach aims at generating Pandas code that can be run on such datasets to produce the desired answer. The approach consists in fine-tuning a Small Language Model (SLM) through Preference Optimization on both positive and negative examples generated by a teacher model.A base SLM is first elicited to produce the code to compute the answer to a question through a Chain of Thought (CoT) prompt. We performed extensive testing on the DataBench development set, exploring a variety of prompts, eventually settling on a detailed instruction prompt, followed by two-shot examples. Due to hardware constraints, the base model was an SLM with ${leq}$ 8 billion parameters.We then fine-tuned the model through Odds Ratio Preference Optimization (ORPO) using as training data the code produced by a teacher model on the DataBench training set. The teacher model was GPT-4o, whose code was labeled preferred, while the code generated by the base model was rejected. This increased the accuracy on the development set from 71% to 85%.Our method demonstrated robust performance in answering complex questions across diverse datasets, highlighting the effectiveness of combining small LLMs with supervised fine-tuning and automated code execution for tabular question answering.

pdf bib abs
Core Intelligence at SemEval-2025 Task 8: Multi-hop LLM Agent for Tabular Question Answering
Maryna Chernyshevich

This paper describes a multi-hop LLM agent for tabular question answering developed for SemEval-2025 Task 8 and ranked 6th with 87% accuracy. Our approach combines proprietary LLM (ChatGPT-3.5-turbo) for code generation and open source LLM (Llama-3.2-3B) for answer validation.

pdf bib abs
MALTO at SemEval-2025 Task 3: Detecting Hallucinations in LLMs via Uncertainty Quantification and Larger Model Validation
Claudio Savelli | Alkis Koudounas | Flavio Giobergia

Large language models (LLMs) often produce {textit{hallucinations}} —factually incorrect statements that appear highly persuasive. These errors pose risks in fields like healthcare, law, and journalism. This paper presents our approach to the Mu-SHROOM shared task at SemEval 2025, which challenges researchers to detect hallucination spans in LLM outputs. We introduce a new method that combines probability-based analysis with Natural Language Inference to evaluate hallucinations at the word level. Our technique aims to better align with human judgments while working independently of the underlying model. Our experimental results demonstrate the effectiveness of this method compared to existing baselines.

pdf bib abs
LCTeam at SemEval-2025 Task 3: Multilingual Detection of Hallucinations and Overgeneration Mistakes Using XLM-RoBERTa
Araya Hailemariam | Jose Maldonado Rodriguez | Ezgi Başar | Roman Kovalev | Hanna Shcharbakova

In recent years, the tendency of large language models to produce hallucinations has become an object of academic interest. Hallucinated or overgenerated outputs created by LLMs contain factual inaccuracies which can potentially invalidate textual coherence. The Mu-SHROOM shared task sets the goal of developing strategies for detecting hallucinated parts of LLM outputs in a multilingual context. We present an approach applicable across multiple languages, which incorporates the alignment of tokens and hard labels, as well as training a multi-lingual XLM-RoBERTa model. With this approach we managed to achieve 2nd in Chinese and top-10 positions in 7 other language tracks of the competition.

pdf bib abs
CSECU-Learners at SemEval-2025 Task 11: Multilingual Emotion Recognition and Intensity Prediction with Language-tuned Transformers and Multi-sample Dropout
Monir Ahmad | Muhammad Anwarul Azim | Abu Nowshed Chy

In today’s digital era, individuals convey their feelings, viewpoints, and perspectives across various platforms in nuanced and intricate ways. At times, these expressions can be challenging to articulate and interpret. Emotion recognition aims to identify the most relevant emotions in a text that accurately represent the author’s psychological state. Despite its substantial impact on natural language processing (NLP), this task has primarily been researched only in high-resource languages. To bridge this gap, SemEval-2025 Task 11 introduces a multilingual emotion recognition challenge encompassing 32 languages, promoting broader linguistic inclusivity in emotion recognition. This paper presents our participation in this task, where we introduce a language-specific fine-tuned transformer-based system for emotion recognition and emotion intensity prediction. To enhance generalization, we incorporate a multi-sample dropout strategy. Our approach is evaluated across 11 languages, and experimental results demonstrate its competitive performance, achieving top-tier results in certain languages.

pdf bib abs
QleverAnswering-PUCRS at SemEval-2025 Task 8: Exploring LLM agents, code generation and correction for Table Question Answering
André Bergmann Lisboa | Lucas Cardoso Azevedo | Lucas Rafael Costella Pessutto

Table Question Answering (TQA) is a challenging task that requires reasoning over structured data to extract accurate answers. This paper presents QleverAnswering-PUCRS, our submission to SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. QleverAnswering-PUCRS is a modular multi-agent system that employs a structured approach to TQA. The approach revolves around breaking down the task into specialized agents, each dedicated to handling a specific aspect of the problem. Our system was evaluated on benchmark datasets and achieved competitive results, ranking mid-to-top positions in the SemEval-2025 competition. Despite these promising results, we identify areas for improvement, particularly in handling complex queries and nested data structures.

pdf bib abs
Habib University at SemEval-2025 Task 9: Using Ensemble Models for Food Hazard Detection
Rabia Shahab | Iqra Azfar | Hammad Sajid | Ayesha Enayat

Food safety incidents cause serious threats to public health, requiring efficient detection systems. Thisstudy contributes to SemEval 2025 Task 9: Food Hazard Detection by leveraging insights from existing literature and using multiple BERT-based models for multi-label classification of food hazards andproduct categories. Using a dataset of food recall notifications, we applied preprocessing techniquesto prepare data and address challenges like class imbalance. Experimental results show strong hazardclassification performance on ensembled models such as DistilBERT, SciBERT, and DeBERTa but highlight product classification variability. Building on Nancy et al. and Madry et al.’s work, we explored strategies like ensemble modeling and data augmentation to improve accuracy and explainability, paving the way for scalable food safety solutions.

pdf bib abs
iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
Yujian Sun | Tian Li

As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: “Unlearning Sensitive Content from Large Language Models” introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.

pdf bib abs
Word2winners at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval
Amirmohammad Azadi | Sina Zamani | Mohammadmostafa Rostamkhani | Sauleh Eetemadi

This paper describes our system for SemEval 2025 Task 7: Previously Fact-Checked Claim Retrieval. The task requires retrieving relevant fact-checks for a given input claim from the extensive, multilingual MultiClaim dataset, which comprises social media posts and fact-checks in several languages. To address this challenge, we first evaluated zero-shot performance using state-of-the-art English and multilingual retrieval models and then fine-tuned the most promising systems, leveraging machine translation to enhance crosslingual retrieval. Our best model achieved an accuracy of 85% on crosslingual data and 92% on monolingual data.

pdf bib abs
CAISA at SemEval-2025 Task 7: Multilingual and Cross-lingual Fact-Checked Claim Retrieval
Muqaddas Haroon | Shaina Ashraf | Ipek Baris | Lucie Flek

We leveraged LLaMA, utilizing its ability to evaluate the relevance of retrieved claims within a retrieval-based fact-checking framework. This approach aimed to explore the impact of large language models (LLMs) on retrieval tasks and assess their effectiveness in enhancing fact-checking accuracy. Additionally, we integrated Jina embeddings v2 and the MPNet multilingual sentence transformer to filter and rank a set of 500 candidate claims. These refined claims were then used as input for LLaMA, ensuring that only the most contextually relevant ones were assessed.

The {textit{Unlearning Sensitive Content from Large Language Models}} task aims to remove targeted datapoints from trained models while minimally affecting their general knowledge. In our work, we leverage parameter-efficient, gradient-based unlearning using low-rank (LoRA) adaptation and layer-focused fine-tuning. To further enhance unlearning effectiveness, we employ data chunking, splitting forget data into disjoint partitions and merging them with cyclically sampled retain samples at a pre-defined ratio. Our task-agnostic method achieves an outstanding forget-retain balance, ranking first on leaderboards and significantly outperforming baselines and competing systems.

pdf bib abs
Amado at SemEval-2025 Task 11: Multi-label Emotion Detection in Amharic and English Data
Girma Bade | Olga Kolesnikova | Jose Oropeza | Grigori Sidorov | Mesay Yigezu

Amado at SemEval-2025 Task 11: Multi-label Emotion Detection inAmharic and English DataGirma Yohannis Bade, Olga Kolesnikova, José Luis OropezaGrigori Sidorov, Mesay Gemeda Yigezua(Centro de Investigaciones en Computación(CIC),Instituto Politécnico Nacional(IPN), Miguel Othon de Mendizabal,Ciudad de México, 07320, México.)

pdf bib abs
NarrativeNexus at SemEval-2025 Task 10: Entity Framing and Narrative Extraction using BART
Hareem Siraj | Kushal Chandani | Dua E Sameen | Ayesha Enayat

This paper presents NarrativeNexus’ participation in SemEval-2025 Task 10 on fine-grained entity framing and narrative extraction. Our approach utilizes BART, a transformer-based encoder-decoder model, fine-tuned for sequence classification and text generation.For Subtask 1, we employed a BART-based sequence classifier to identify and categorize named entities within news articles, mapping them to predefined roles such as protagonists, antagonists, and innocents. In Subtask 3, we leveraged a text-to-text generative approach to generate justifications for dominant narratives.Our methodology included hyperparameter tuning, data augmentation, and ablation studies to assess model components. NarrativeNexus achieved 18th place in Subtask 1 and 10th in Subtask 3 on the English dataset. Our findings highlight the strengths of pre-trained transformers in structured content analysis while identifying areas for future improvements in nuanced entity framing.

pdf bib abs
Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization
Jan Bronec | Jindřich Helcl

We present a submission to the SemEval 2025 shared task on unlearning sensitive content from LLMs. Our approach employs negative preference optimization using low-rank adaptation. We show that we can utilize this combination to cheaply compute additional regularization terms, which help with unlearning stabilization. The results of our approach significantly exceed the shared task baselines.

pdf bib abs
AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering
Andreas Evangelatos | George Filandrianos | Maria Lymperaiou | Athanasios Voulodimos | Giorgos Stamou

In this paper, we present our submission to SemEval-2025 Task 8: Question Answering over Tabular Data. This task, evaluated on the DataBench dataset, assesses Large Language Models’ (LLMs) ability to answer natural language questions over structured data while addressing topic diversity and table size limitations in previous benchmarks. We propose a system that employs effective LLM prompting to translate natural language queries into executable code, enabling accurate responses, error correction, and interpretability. Our approach ranks first in both subtasks of the competition in the proprietary model category, significantly outperforming the organizer’s baseline.

pdf bib abs
HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection
Mohamed Abdallah | Samhaa El - Beltagy

We present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs as part of Mu-SHROOM. HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in 14 different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top 10%) and Czech. While the system’s retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.

pdf bib abs
COGNAC at SemEval-2025 Task 10: Multi-level Narrative Classification with Summarization and Hierarchical Prompting
Azwad Anjum Islam | Mark Finlayson

We present our approach to solving the Narrative Classification portion of the Multilingual Characterization and Extraction of Narratives SemEval-2025 challenge (Task 10, Subtask 2). This task is a multi-label, multi-class document classification task, where the classes were defined via natural language titles, descriptions, short examples, and annotator instructions, with only a few (and sometime no) labeled examples for training. Our approach leverages a text-summarization, binary relevance with zero-shot prompts, and hierarchical prompting using Large Language Models (LLM) to identify the narratives and subnarratives in the provided news articles. Notably, we did not use the labeled examples to train the system. Our approach well outperforms the official baseline and achieves an F1 score of 0.55 (narratives) and 0.43 (subnarratives), and placed 2nd in the test-set leaderboard at the system submission deadline. We provide an in-depth analysis of the construction and effectiveness of our approach using both open-source (LLaMA 3.1-8B-Instruct) and proprietary (GPT 4o-mini) Large Language Models under different prompting setups.

pdf bib abs
SyntaxMind at SemEval-2025 Task 11: BERT Base Multi-label Emotion Detection Using Gated Recurrent Unit
Md. Shihab Uddin Riad | Mohammad Aman Ullah

Emotions influence human behavior, speech, and expression, making their detection crucial in Natural Language Processing (NLP). While most prior research has focused on single-label emotion classification, real-world emotions are often multi-faceted. This paper describes our participation in SemEval-2025 Task 11, Track A (Multi-label Emotion Detection) and Track B (Emotion Intensity). We employed BERT as a feature extractor with stacked GRUs, which resulted in better stability and convergence. Our system was evaluated across 19 languages for Track A and 9 languages for Track B.

pdf bib abs
DEMON at SemEval-2025 Task 10: Fine-tuning LLaMA-3 for Multilingual Entity Framing
Matteo Fenu | Manuela Sanguinetti | Maurizio Atzori

This study introduces a methodology centred on Llama 3 fine-tuning for the classification of entities mentioned within news articles, based on a predefined role taxonomy. The research is conducted as part of SemEval-2025 Task 10, which focuses on the automatic identification of narratives, their classification, and the determination of the roles of the relevant entities involved. The developed system was specifically used within Subtask 1 on Entity Framing. The approach used is based on parameter-efficient fine-tuning, in order to minimize the computational costs while maintaining reasonably good model performance across all datasets and languages involved.The model achieved promising results on both the development and test sets. Specifically, during the final evaluation phase, it attained an average accuracy of 0.84 on the main role and an average Exact Match Ratio of 0.41 in the prediction of fine-grained roles across all the five languages involved, i.e. Bulgarian, English, Hindi, Portuguese and Russian. The best performance was observed for English (3rd place out of 32 participants), on a par with Hindi and Russian. The paper provides an overview of the system adopted for the task and discusses the results obtained.

pdf bib abs
ITF-NLP at SemEval-2025 Task 11 An Exploration of English and German Multi-label Emotion Detection using Fine-tuned Transformer Models
Samantha Kent | Theresa Nindel

We present our submission to Task 11, Bridging the Gap in Text-Based Emotion Detection, of the 19th International Workshop on Semantic Evaluation (SemEval) 2025. We participated in track A, multi-label emotion detection, in both German and English. Our approach is based on fine-tuning transformer models for each language, and our models achieve a Macro F1 of 0.75 and 0.62 for English and German respectively. Furthermore, we analyze the data available for training to gain insight into the model predictions.

pdf bib abs
RaggedyFive at SemEval-2025 Task 3: Hallucination Span Detection Using Unverifiable Answer Detection
Wessel Heerema | Collin Krooneman | Simon Van Loon | Jelmer Top | Maurice Voors

Despite their broad utility, large language models (LLMs) are prone to hallucinations. The deviation from provided source inputs or disparateness with factual accuracy makes users question the reliability of LLMs. Therefore, detection systems for LLMs on hallucination are imperative. The system described in this paper detects hallucinated text spans by combining Retrieval-Augmented Generation (RAG) with Natural Language Interface (NLI). While zero-context handling of the RAG had little measurable effect, incorporating the RAG into a natural-language premise for the NLI yielded a noticeable improvement. Discrepancies can be attributed to labeling methodology and the implementation of the RAG.

pdf bib abs
JNLP at SemEval-2025 Task 1: Multimodal Idiomaticity Representation with Large Language Models
Blake Matheny | Phuong Nguyen | Minh Nguyen

Idioms and figurative language are nuanced linguistic phenomena that transport semanticity and culture via non-compositional multi-word expressions. This type of figurative language remains difficult for small and large language models to handle. Various attempts have been made to identify idiomaticity in text. The approach presented in this paper represents an intuitive attempt to accurately address Task 1: AdMIRe Subtask A to correctly order a series of images and captions by concatenating the image captions as a sequence. The methods employ the reliability of a pre-trained vision and language model for the image-type task and a large language model with instruction fine-tuning for a causal language model approach to handle the caption portion of the task. The results are informative for future iterations, but not comparably substantial.

Emotions play a fundamental role in the decision-making process, shaping human actions across diverse disciplines. The extensive usage of emotion intensity detection approaches has generated substantial research interest during the last few years. Efficient multi-label emotion intensity detection remains unsatisfactory even for high-resource languages, with a substantial performance gap among well-resourced and under-resourced languages. Team {textbf{Tewodros}} participated in SemEval-2025 Task 11, Track B, focusing on detecting text-based emotion intensity. Our work involved multi-label emotion intensity detection across three languages: Amharic, English, and Spanish, using the (afro-xlmr-large-76L), (DeBERTa-v3-base), and (BERT-base-Spanish-wwm-uncased) models. The models achieved an average F1 score of 0.6503 for Amharic, 0.5943 for English, and an accuracy score of 0.6228 for Spanish. These results demonstrate the effectiveness of our models in capturing emotion intensity across multiple languages.

pdf bib abs
SheffieldGATE at SemEval-2025 Task 2: Multi-Stage Reasoning with Knowledge Fusion for Entity Translation
Xinye Yang | Kalina Bontcheva | Xingyi Song

This paper describes the machine translation system submitted to the SemEval-2025 Entity-Aware Machine Translation Task by the SheffieldGATE Team. We proposed a multi-agent entity-aware machine translation system that operates through three distinct reasoning stages: entity recognition, knowledge enhancement, and translation decision-making. The innovation in our approach lies in leveraging large language models to generate contextually relevant queries during the knowledge enhancement stage, extracting candidate entities and their translations from external knowledge bases. In the final translation decision-making stage, we employ fine-tuned large language models to denoise the retrieved knowledge, selecting the most relevant entity information to ensure accurate translation of the original text. Experimental results demonstrate our system’s effectiveness. In emEval-2025 Task 2, our system ranks first among all systems in Spanish entity translation metrics and third in Italian. For systems that do not use gold standard entity IDs during test set inference, ours achieves the highest overall scores across four language pairs: German, French, Italian, and Spanish.

pdf bib abs
ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation
Atakan Site | Emre Erdemir | Gülşen Eryiğit

This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answeringover Tabular Data. The primary objective ofthis task is to perform question answering ongiven tabular datasets from diverse domains;under two subtasks: DataBench QA (SubtaskI) and DataBench Lite QA (Subtask II). Totackle both subtasks, we developed a zero-shotsolution with a particular emphasis on lever-aging Large Language Model (LLM)-basedcode generation. Specifically, we proposeda Python code generation framework, utiliz-ing state-of-the-art open-source LLMs to gen-erate executable Pandas code via optimizedprompting strategies. Our experiments revealthat different LLMs exhibit varying levels ofeffectiveness in Python code generation. Addi-tionaly, results show that Python code genera-tion achieves superior performance in tabularquestion answering compared to alternative ap-proaches. Although our ranking among zero-shot systems is unknown at the time of this pa-per’s submission, our system achieved eighthplace in Subtask I and sixth place in Subtask IIamong the 30 systems that outperformed thebaseline in the open-source models category.

pdf bib abs
Fossils at SemEval-2025 Task 9: Tasting Loss Functions for Food Hazard Detection in Text Reports
Aman Sinha | Federica Gamba

Food hazard detection is an emerging field where NLP solutions are being explored. Despite the recent accessibility of powerful language models, one of the key challenges that still persists is the high class imbalance within datasets, often referred to in the literature as the {textit{long tail problem}}.In this work, we present a study exploring different loss functions borrowed from the field of visual recognition, to tackle long-tailed class imbalance for food hazard detection in text reports. Our submission to SemEval-2025 Task 9 on the Food Hazard Detection Challenge shows how re-weighting mechanism in loss functions prove beneficial in class imbalance scenarios. In particular, we empirically show that class-balanced and focal loss functions outperform all other loss strategies for Subtask 1 and 2 respectively.

pdf bib abs
Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss
Zhuoang Cai | Zhenghao Li | Yang Liu | Liyuan Guo | Yangqiu Song

Classification tasks often suffer from imbal- anced data distribution, which presents chal- lenges in food hazard detection due to severe class imbalances, short and unstructured text, and overlapping semantic categories. In this paper, we present our system for SemEval- 2025 Task 9: Food Hazard Detection, which ad- dresses these issues by applying data augmenta- tion techniques to improve classification perfor- mance. We utilize transformer-based models, BERT and RoBERTa, as backbone classifiers and explore various data balancing strategies, including random oversampling, Easy Data Augmentation (EDA), and focal loss. Our ex- periments show that EDA effectively mitigates class imbalance, leading to significant improve- ments in accuracy and F1 scores. Furthermore, combining focal loss with oversampling and EDA further enhances model robustness, par- ticularly for hard-to-classify examples. These findings contribute to the development of more effective NLP-based classification models for food hazard detection.

pdf bib abs
Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs
Aleksey Kudelya | Alexander Shirnin

This paper describes LIBU (LoRA enhanced influence-based unlearning), an algorithm to solve the task of unlearning - removing specific knowledge from a large language model without retraining from scratch and compromising its overall utility (SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models). The algorithm combines classical influence functions to remove the influence of thedata from the model and second-order optimization to stabilize the overall utility. Our experiments show that this lightweight approach is well applicable for unlearning LLMs in different kinds of task.

pdf bib abs
VerbaNexAI at SemEval-2025 Task 3: Fact Retrieval with Google Snippets for LLM Context Filtering to identify Hallucinations
Anderson Morillo | Edwin Puertas | Juan Carlos Martinez Santos

Thefirst approach leverages advanced LLMs, employing a chain-of-thought prompting strategywith one-shot learning and Google snippets forcontext retrieval, demonstrating superior performance. The second approach utilizes traditional NLP analysis techniques, including semantic ranking, token-level extraction, and rigorous data cleaning, to identify hallucinations

pdf bib abs
Team KiAmSo at SemEval-2025 Task 11: A Comparison of Classification Models for Multi-label Emotion Detection
Kimberly Sharp | Sofia Kathmann | Amelie Rüeck

The aim of this paper is to take on the challenge of multi-label emotion detection for a variety of languages as part of Track A in SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. We fine-tune different pre-trained mono- and multilingual language models and compare their performance on multi-label emotion detection on a variety of high-resource and low-resource languages. Overall, we find that monolingual models tend to perform better, but for low-resource languages that do not have state-of-the-art pre-trained language models, multilingual models can achieve comparable results.

pdf bib abs
FiRC-NLP at SemEval-2025 Task 11: To Prompt or to Fine-Tune? Approaches for Multilingual Emotion Classification
Wondimagegnhue Tufa | Fadi Hassan | Evgenii Migaev | Yalei Fu

In this paper, we describe our system devel-oped for participation in SemEval-2025 Task11: Bridging the Gap in Text-Based EmotionDetection. We compare three approaches formultilingual, multi-label emotion classification:XLM-R, an ensemble of models (XLM-5), anda prompt-based approach. We evaluate the per-formance of these models across a diverse setof languages, ranging from high-resource tolow-resource languages

pdf bib abs
GIL-IIMAS UNAM at SemEval-2025 Task 4: LA-Min(E): LLM Unlearning Approaches Under Function Minimizing Evaluation Constraints
Karla Salas - Jimenez | Francisco López - Ponce | Diego Hernández - Bustamante | Gemma Bel - Enguix | Helena Gómez - Adorno

This paper describes Gradient Ascent and Task Vectors as LLM unlearning methodologies applied to SemEval 2025’s task 4. This task focuses on LLM unlearning on specific information under the constraints of preserving the model’s advanced text generation capabilities; meaning that our implementations of these algorithms were constrained both in the information datasets as well as the overall effect of each algorithm in the model’s general performance. Our implementation produced modified language models that ranked 7th out of 14 valid participants in the 7B parameter model, and 6th out of 24 in the 1B parameter model.

pdf bib abs
UncleLM at SemEval-2025 Task 11: RAG-Based Few-Shot Learning and Fine-Tuned Encoders for Multilingual Emotion Detection
Mobin Barfi | Sajjad Mehrpeyma | Nasser Mozayani

This paper presents our approach for SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. We investigate multiple methodologies, including fine-tuning transformer models and few-shot learning with GPT-4o-mini, incorporating Retrieval-Augmented Generation (RAG) for emotion intensity estimation. Our approach also leverages back-translation for data augmentation and threshold optimization to improve multi-label emotion classification. The experiments evaluate performance across multiple languages, including low-resource settings, with a focus on enhancing cross-lingual emotion detection.

pdf bib abs
UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection
Frances Adriana Laureano De Leon | Yixiao Wang | Yue Feng | Mark Lee

Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda. In Track C, our system ranked 5th for Oromo, Tigrinya, Kinyarwanda, Amharic, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite using significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.

pdf bib abs
GIL-IIMAS UNAM at SemEval-2025 Task 3: MeSSI: A Multilmodule System to detect hallucinated Segments in trivia-like Inquiries.
Francisco Lopez - Ponce | Karla Salas - Jimenez | Adrián Juárez - Pérez | Diego Hernández - Bustamente | Gemma Bel - Enguix | Helena Gomez - Adorno

We present MeSSI, a multi-module system applied to SemEval 2025’s task 3: Mu-SHROOM. Our system tags questions in order to obtain semantic relevant terms that are used as information retrieval characteristics. Said characteristics serve as extraction terms for Wikipedia pages that are in turn processed to generate gold standard texts used in a hallucination evaluation system. A PoST-based entity comparison was implemented to contrast the test dataset sentences with the corresponding generated gold standards, wich in turn was the main criteria to tag hallucinations, partitioned in soft labels and hard labels. This method was tested in Spanish and English, finishing 18th and 19th respectively on the IoU based ranking.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 10: Ensembling LLMs for Multi-lingual Multi-Label and Multi-Class Meta-Classification
Saurav Aryal | Prasun Dhungana

This paper describes our approach and submission to the SemEval 2025 shared task on “Multilingual Characterization and Extraction of Narratives from Online News”. The purpose of this task was to assign primary and fine-grained roles to named entities in news articles from five different languages, on the topics of Climate Change and Ukraine-Russia War. In this paper, we explain how we approached the task by utilizing multiple LLMs via Prompt Engineering and combining their results into a final task result through an ensemble meta-classification technique. Our experimental results demonstrate that this integrated approach outperforms the provided baseline in detecting bias, deception, and manipulation in news media across multiple languages.

pdf bib abs
HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance
Muhammad Saad | Meesum Abbas | Sandesh Kumar | Abdul Samad

This paper presents a solution to the food hazard detection challenge in the SemEval-2025 Task 9, focusing on overcoming class imbalance using data augmentation techniques. We employ large language models (LLMs) like GPT-4o, Gemini Flash 1.5, and T5 to generate synthetic data, alongside other methods like synonym replacement, back-translation, and paraphrasing. These augmented datasets are used to fine-tune transformer-based models such as DistilBERT, improving their performance in detecting food hazards and categorizing products. Our approach achieves notable improvements in macro-F1 scores for both subtasks, although challenges remain in detecting implicit hazards and handling extreme class imbalance. The paper also discusses various techniques, including class weighting and ensemble modeling, as part of the training process. Despite the improvements, further work is necessary to refine hazard detection, particularly for rare and implicit categories.

pdf bib abs
FunghiFunghi at SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Tariq Ballout | Pieter Jansma | Nander Koops | Yong Hui Zhou

Large Language Models (LLMs) often generate hallucinated content, which is factually incorrect or misleading, posing reliability challenges. The Mu-SHROOM shared task addresses hallucination detection in multilingualLLM-generated text. This study employsSpanBERT, a transformer model optimized forspan-based predictions, to identify hallucinatedspans across multiple languages. To addresslimited training data, we apply dataset augmentation through translation and synthetic generation. The model is evaluated using Intersection over Union (IoU) for span detectionand Spearman’s correlation for ranking consistency. While the model detects hallucinatedspans with moderate accuracy, it struggles withranking confidence scores. These findings highlight the need for improved probability calibration and multilingual robustness. Future workshould refine ranking methods and explore ensemble models for better performance.

pdf bib abs
CIC-IPN at SemEval-2025 Task 11: Transformer-Based Approach to Multi-Class Emotion Detection
Tolulope Abiola | Olumide Ebenezer Ojo | Grigori Sidorov | Olga Kolesnikova | Hiram Calvo

This paper presents a multi-step approach for multi-label emotion classification as our system description paper for the SEMEVAL-2025 workshop Task A using machine learning and deep learning models. We test our methodology on English, Spanish, and low-resource Yoruba datasets, with each dataset labeled with five emotion categories: anger, fear, joy, sadness, and surprise. Our preprocessing involves text cleaning and feature extraction using bigrams and TF-IDF. We employ logistic regression for baseline classification and fine-tune Transformer models, such as BERT and XLM-RoBERTa, for improved performance. The Transformer-based models outperformed the logistic regression model, achieving micro-F1 scores of 0.7061, 0.7321, and 0.2825 for English, Spanish, and Yoruba, respectively. Notably, our Yoruba fine-tuned model outperformed the baseline model of the task organizers with micro-F1 score of 0.092, demonstrating the effectiveness of Transformer models in handling emotion classification tasks across diverse languages.

pdf bib abs
Mr. Snuffleupagus at SemEval-2025 Task 4: Unlearning Factual Knowledge from LLMs Using Adaptive RMU
Arjun Dosajh | Mihika Sanghi

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their tendency to memorize training data raises concerns regarding privacy, copyright compliance, and security, particularly in cases involving Personally Identifiable Information (PII). Effective machine unlearning techniques are essential to mitigate these risks, yet existing methods remain underdeveloped for LLMs due to their open-ended output space. In this work, we apply the Adaptive Representation Misdirection Unlearning (RMU) technique to unlearn sensitive information from LLMs. Through extensive experiments, we analyze the effects of unlearning across different decoder layers to determine the most effective regions for sensitive information removal. Our technique ranked 4th on the official leaderboard of both 1B parameter and 7B parameter models.

The focus of SemEval-2024 Task 7 is the retrieval of relevant fact-checks for social media posts across multiple languages. We approach this task with an enhanced bi-encoder retrieval setup, which is designed to match social media posts with relevant fact-checks using synthetic data from LLMs. We explored and analyzed two main approaches for generating synthetic posts. Either based on existing fact-checks or on existing posts. Our approach achieved an S@10 score of 89.53% for the monolingual task and 74.48% for the crosslingual task, ranking 16th out of 28 and 13th out of 29, respectively. Without data augmentation, scores would have been 88.69 (17th) and 72.93 (15th).

pdf bib abs
NarrativeMiners at SemEval-2025 Task 10: Combating Manipulative Narratives in Online News
Muhammad Khubaib | Muhammad Shoaib Khursheed | Muminah Khurram | Abdul Samad | Sandesh Kumar

Our team, Narrative Miners, participated in SemEval-2025 Task 10 to tackle the challenge of detecting manipulative narratives in online news, focusing on the Ukraine-Russia war and climate change. We worked on three key subtasks: classifying entity roles, categorizing narratives and subnarratives, and generating concise narrative explanations. Using transformer-based models like BART, BERT, GPT-2, and Flan-T5, we implemented a structured pipeline and applied data augmentation to enhance performance. BART-CNN proved to be our best-performing model, significantly improving classification accuracy and explanation generation. Despite challenges like dataset limitations and class imbalance, our approach demonstrated the effectiveness of hierarchical classification and multilingual analysis in combating online disinformation. We made use of different data augmentation techniques to cover the class imbalances present in the dataset. We had different evaluation metrics set for each subtask, specifically focusing on the need of that particular outcome. With this paper, we hope to play our part in mitigating the impact of harmful disinformation.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 11: Combining Expert Personas via Prompting for Enhanced Multilingual Emotion Analysis
Amir Ince | Saurav Aryal

For our approach to SemEval-2025 Task 11, we employ a multi-tier evaluation framework for perceived emotion analysis. Our system consists of a smaller-parameter-size large language model that independently predicts a given text’s perceived emotion while explaining the reasoning behind its decision. The initial model’s persona is varied through careful prompting, allowing it to represent multiple perspectives. These outputs, including both predictions and reasoning, are aggregated and fed into a final decision-making model that determines the ultimate emotion classification. We evaluated our approach in official SemEval Task 11 on subtasks A and C in all the languages provided.

Emotion detection in text has emerged as a pivotal challenge in Natural Language Processing (NLP), particularly in multilingual and cross-lingual contexts. This paper presents our participation in SemEval 2025 Task 11, focusing on three subtasks: Multi-label Emotion Detection, Emotion Intensity Prediction, and Cross-lingual Emotion Detection. Leveraging state-of-the-art transformer models such as BERT and XLM-RoBERTa, we implemented baseline models and ensemble techniques to enhance predictive accuracy. Additionally, innovative approaches like data augmentation and translation-based cross-lingual emotion detection were used to address linguistic and class imbalances. Our results demonstrated significant improvements in F1 scores and Pearson correlations, showcasing the effectiveness of ensemble learning and transformer-based architectures in emotion recognition. This work advances the field by providing robust methods for emotion detection, particularly in low-resource and multilingual settings.

pdf bib abs
NLP-Cimat at SemEval-2025 Task 11: Prompt Optimization for LLMs via Genetic Algorithms and Systematic Mutation applied on Emotion Detection
Guillermo Segura-Gómez | Adrian Pastor Lopez Monroy | Fernando Sanchez - Vega | Alejandro Rosales Pérez

Large Language Models (LLMs) have shown remarkable performance across diverse natural language processing tasks in recent years. However, optimizing instructions to maximize model performance remains a challenge due to the vast search space and the nonlinear relationship between input structure and output quality. This work explores an alternative prompt optimization technique based on genetic algorithms with different structured mutation processes. Unlike traditional random mutations, our method introduces variability in each generation through a guided mutation, enhancing the likelihood of obtaining better prompts for each generation. We apply this approach to emotion detection in the context of SemEval 2025 Task 11, demonstrating the potential to improve prompt efficiency, and consequently task performance. Experimental results show that our method yields competitive results compared to standard optimization techniques while maintaining interpretability and scalability.

pdf bib abs
WC Team at SemEval-2025 Task 6: PromiseEval: Multinational, Multilingual, Multi-Industry Promise Verification leveraging monolingual and multilingual BERT models
Takumi Nishi | Nicole Miu Takagi

This paper presents our system developed for SemEval-2025 Task 6: PromiseEval: Multinational, Multilingual, Multi-Industry Promise Verification. The task aims at identifying “promises” made and “evidence” provided in company ESG statements for various languages. Our team participated in Subtasks 1 and 2 for the languages English, French, and Japanese. In this work, we propose using BERT and finetuning it to better address the task. We achieve competitive results, especially for English and Japanese.

Emotion intensity prediction in text enhances conversational AI by enabling a deeper understanding of nuanced human emotions, a crucial yet underexplored aspect of natural language processing (NLP). This study employs Transformer-based models to classify emotion intensity levels (0–3) for five emotions: anger, fear, joy, sadness, and surprise. The dataset, sourced from the SemEval shared task, was preprocessed to address class imbalance, and model training was performed using fine-tuned *bert-base-uncased*. Evaluation metrics showed that *sadness* achieved the highest accuracy (0.8017) and F1-macro (0.5916), while *fear* had the lowest accuracy (0.5690) despite a competitive F1-macro (0.5207). The results demonstrate the potential of Transformer-based models in emotion intensity prediction while highlighting the need for further improvements in class balancing and contextual representation.

pdf bib abs
NLP_CIMAT at SemEval-2025 Task 3: Just Ask GPT or look Inside. A prompt and Neural Networks Approach to Hallucination Detection
Jaime Stack - Sánchez | Miguel Alvarez - Carmona | Adrian Pastor Lopez Monroy

This paper presents NLP_CIMAT’s participation in SemEval-2025 Task 3, which focuses on hallucination detection in large language models (LLMs) at character level across multiple languages. Hallucinations—outputs that are coherent and well-formed but contain inaccurate or fabricated information—pose significant challenges in real-world NLP applications. We explore two primary approaches: (1) a prompt-based method that leverages LLMs’ own reasoning capabilities and knowledge, with and without external knowledge through a Retrieval-Augmented Generation (RAG)-like framework, and (2) a neural network approach that utilizes the hidden states of a LLM to predict hallucinated tokens. We analyze various factors in the neural approach, such as multilingual training, informing about the language, and hidden state selection. Our findings highlight that incorporating external information, like wikipedia articles, improves hallucination detection, particularly for smaller LLMs. Moreover, our best prompt-based technique secured second place in the Spanish category, demonstrating the effectiveness of in-context learning for this task.

pdf bib abs
CSIRO LT at SemEval-2025 Task 8: Answering Questions over Tabular Data using LLMs
Tomas Turek | Shakila Mahjabin Tonni | Vincent Nguyen | Huichen Yang | Sarvnaz Karimi

Question Answering over large tables is challenging due to the difficulty of reasoning required in linking information from different parts of a table, such as heading and metadata to the values in the table and information needs. We investigate using Large Language Models (LLM) for tabular reasoning, where, given a pair of a table and a question from the DataBench benchmark, the models generate answers. We experiment with three techniques that enables symbolic reasoning through code execution: a direct code prompting (DCP) approach, ‘DCP_Py’, which uses Python, multi-step code (MSC) prompting ‘MSC_SQL+FS’ using SQL and ReAct prompting, ‘MSR_Py+FS’, which combines multi-step reasoning (MSR), few-shot (FS) learning and Python tools. We also conduct an analysis exploring the impact of answer types, data size, and multi-column dependencies on LLMs’ answer generation performance, including an assessment of the models’ limitations and the underlying challenges of tabular reasoning in LLMs.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 8: DeepTabCoder - Code-based Retrieval and In-context Learning for Question-Answering over Tabular Data
Saharsha Tiwari | Saurav Aryal

This paper presents our approach, named DeepTabCoder, to SemEval 2025 - Task 8: DataBench, which focuses on question-answering over tabular data. We utilize a code-based retrieval system combined with in-context learning, which generates and executes code to answer questions, leveraging DeepSeek-V3 for code generation. DeepTabCoder outperforms the baseline, achieving accuracies of 81.42% on the DataBench dataset and 80.46% on the DataBench Lite dataset.

We describe the methods used by our UAlberta team for the SemEval-2025 Task 2 on Entity-Aware Machine Translation (EA-MT). Our methods leverage large language models with prompt engineering strategies suited to this task, including retrieval augmented generation and in-context learning. Our best results overall are obtained with ensembles of multiple models, leveraging named entity knowledge in the dataset. Finally, we provide proof-of-concept experiments showing that lexico-semantic knowledge can be used to identify high-quality translations. We further demonstrate that our methods can function even without gold named entity translations, by using an alternative knowledge base such as BabelNet.

pdf bib abs
Oath Breakers at SemEval-2025 Task 06: PromiseEval
Muhammad Khubaib | Owais Aijaz | Ayesha Enayat

SemEval Task 6: Promise Eval, was designed to evaluate a company’s adherence to its ESG commitments. Using Natural Language Processing (NLP) and Deep Learning techniques, the task involves analyzing ESG reports to identify, classify, and verify corporate promises. The verification process follows a structured pipeline with four subtasks: Promise Classification, Evidence Verification, Evidence Classification, and Timeline Verification. These subtasks ensure that identified promises are well-defined, supported by credible evidence, and time-bound.For model implementation, BERT was initially used for most of the classification tasks but was later replaced with DeBERTa, which improved performance due to its superior contextual understanding. To enhance model generalization, contrastive learning was applied alongside standard classification loss, helping the model differentiate between positive and negative examples. Oversampling techniques were used to address class imbalance issues, particularly for the Misleading evidence category. For timeline verification, BART was chosen initially but then shifted to DeBERTa again, as it better captures sequential dependencies in text.The dataset consists of ESG reports containing labeled promise statements, evidence snippets, and timeline information. The data was preprocessed by tokenizing text, handling imbalanced classes through oversampling, and incorporating domain-specific embeddings to improve understanding.By implementing these techniques, the research aims to provide a transparent and accountable framework for assessing corporate promises, ensuring that companies are held accountable for their ESG commitments.

pdf bib abs
Team Cantharellus at SemEval-2025 Task 3: Hallucination Span Detection with Fine Tuning on Weakly Supervised Synthetic Data
Xinyuan Mo | Nikolay Vorontsov | Tiankai Zang

This paper describes our submission to SemEval-2025 Task-3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, which mainly aims at detecting spans of LLM-generated text corresponding to hallucinations in multilingual and multi-model context. We explored an approach of fine-tuning pretrained language models available on Hugging Face. The results show that predictions made by a pretrained model fine-tuned on synthetic data achieve a relatively high degree of alignment with human-generated labels. We participated in 13 out of 14 available languages and reached an average ranking of 10th out of 41 participating teams, with our highest ranking reaching the top 5 place.

This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model’s confidence scores and the actual presence of hallucinations. The IoU score indicates that our modelhas a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.

pdf bib abs
nlptuducd at SemEval-2025 Task 10: Narrative Classification as a Retrieval Task through Story Embeddings
Arjumand Younus | Muhammad Atif Qureshi

One of the most widely used elements in misinformation campaigns is media framing via certain angles which in turn implies pitching news stories through a certain narrative. Narrative twisting to align with a political agenda includes complex dynamics involving different topics, patterns and rhetoric; there is however a certain coherence with respect to the media framing agenda that is to be promoted. The shared task’s objective is to develop models for classifying narratives in online news from a pre-defined two-level taxonomy (Subtask 2). In this paper, we discuss the application of a Mistral 7B model, specifically E5 model, to address theSubtask two in English about finding the narrative taxonomy that a news article is trying to pitch. Our approach frames the task as a retrieval task in a similarity matching framework instead of reliance supervised learning. Our approach based on the use of a Mistral 7B model obtains a F1 on samples of 0.226 and is able to outperform the baseline provided for the competition.

pdf bib abs
MALTO at SemEval-2025 Task 4: Dual Teachers for Unlearning Sensitive Content in LLMs
Claudio Savelli | Evren Munis | Erfan Bayat | Andrea Grieco | Flavio Giobergia

Large language models (LLMs) may retain and reproduce sensitive information learned during training, posing significant privacy and ethical concerns. Once detected, this personal information should be deleted from the model. A naive answer could be to retrain these models from scratch when needed. However, this solution is unfeasible given the immense computational, economic, and environmental costs required to train these models. For this reason, Machine Unlearning (MU) has risen in recent years as an emerging field of research to efficiently delete specific information from a model’s knowledge. This paper presents our solution to the “Unlearning sensitive content from Large Language Models” shared task at SemEval-2025, which challenges researchers to develop effective LLM MU techniques. We adopt a Dual-Teacher framework that leverages a Competent and an Incompetent Teacher to erase unwanted information while selectively preserving model utility. Our approach adapts established computer vision unlearning methods to the sequential nature of language models through KL divergence minimization over next-token prediction probabilities. Our experimental results demonstrate that our method outperforms the state-of-the-art techniques.

pdf bib abs
TueCL at SemEval-2025 Task 1: Image-Augmented Prompting and Multimodal Reasoning for Enhanced Idiom Understanding
Yue Yu | Jiarong Tang | Ruitong Liu

This paper presents our approach for SemEval-2025 Task 1, Advancing Multimodal Idiomaticity Representation (AdMIRe), which focuses on idiom image ranking via semantic similarity. We explore multiple strategies, including neural networks on extracted embeddings and Siamese networks with triplet loss. A key component of our methodology is the application of advanced prompt engineeringtechniques within multimodal in-context learning (ManyICL), leveraging GPT-4o, CLIP.Our experiments demonstrate that structured and optimized prompts significantly enhancethe model’s ability to interpret idiomatic expressions in a multimodal setting.

pdf bib abs
FJWU_Squad at SemEval-2025 Task 1: An Idiom Visual Understanding Dataset for Idiom Learning
Maira Khatoon | Arooj Kiyani | Tehmina Farid | Sadaf Abdul Rauf

Idiomatic expressions pose difficulties for Natural Language Processing (NLP) because they are noncompositional. In this paper, we propose the Idiom Visual Understanding Dataset (IVUD), a multimodal dataset for idiom understanding using visual and textual representation. For SemEval-2025 Task 1 (AdMIRe), we specifically addressed dataset augmentation using AI-synthesized images and human-directed prompt engineering. We compared the efficacy of vision- and text-based models in ranking images aligned with idiomatic phrases. The results identify the advantages of using multimodal context for enhanced idiom understanding, showcasing how vision-language models perform better than text-only approaches in the detection of idiomaticity.

pdf bib abs
CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification
Eeham Khan | Nawar Turk | Leila Kosseim

This paper presents our approach to the PromiseEval task at SemEval-2025, which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a private leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 4: Unlearning Sensitive Content From Large Language Models Using Finetuning and Distillation for Selective Knowledge Removal
Aayush Acharya | Saurav Aryal

This paper presents our approach and submission to the SemEval 2025 task on “Unlearning Sensitive Content from Large Language Models.” The task focuses on making LLMs forget specific knowledge, such as copyrighted material and personally identifiable information (PII), without needing expensive retraining from scratch on the OLMo model. We propose a method to unlearn using fine-tuning and knowledge distillation. Our approach involves fine-tuning separate models on “retain” and “forget” datasets to preserve or suppress knowledge selectively. We then distill the model by suppressing logarithmic data from the fine-tuned model without learning using a combined loss of L2, KL divergence and cosine similarity while retaining knowledge from the fine-tuned model with retention using KL divergence loss.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval-Combining Zero-Shot Claim Extraction and KNN-Based Classification for Multilingual Claim Matching
Suprabhat Rijal | Saurav Aryal

SemEval Task 7 introduced a dataset for multilingual and cross-lingual fact checking. We propose a system that leverages similarity matching, KNN, zero-shot classification and summarization to retrieve fact-checks for social media posts across multiple languages. Our approach achieves performance within the expected range, aligning with baseline results. Although competitive, the findings highlight the potential and challenges of zero-shot methods, providing a foundation for future research in cross-lingual information verification.

pdf bib abs
McGill-NLP at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Vivek Verma | David Ifeoluwa Adelani

In this paper, we present the results of our SemEval-2025 Emotion Detection Shared Task Track A which focuses on multi-label emotion detection. Our team’s approach leverages prompting GPT-4o, fine-tuning NLLB- LLM2Vec encoder, and an ensemble of these two approaches to solve Track A. Our ensemble method beats the baseline method that fine-tuned RemBERT encoder in 24 of the 28 languages. Furthermore, our results shows that the average performance is much worse for under-resourced languages in the Afro- Asiatic, Niger-Congo and Austronesia with per- formance scores at 50 F1 points and below.

pdf bib abs
Howard University - AI4PC at SemEval-2025 Task 3: Logit-based Supervised Token Classification for Multilingual Hallucination Span Identification Using XGBOD
Saurav Aryal | Mildness Akomoize

This paper describes our system for SemEval-2025 Task 3, Mu-SHROOM, which focuses on detecting hallucination spans in multilingual LLM outputs. We reframe hallucination detection as a point-wise anomaly detection problem by treating logits as time-series data. Our approach extracts features from token-level logits, addresses class imbalance with SMOTE, and trains an XGBOD model for probabilistic character-level predictions. Our system, which relies solely on information derived from the logits and token offsets (using pretrained tokenizers), achieves competitive intersection-over-union (IoU) and correlation scores on the validation and test set.

pdf bib abs
NTA at SemEval-2025 Task 11: Enhanced Multilingual Textual Multi-label Emotion Detection via Integrated Augmentation Learning
Nguyen Pham Hoang Le | An Nguyen Tran Khuong | Tram Nguyen Thi Ngoc | Thin Dang Van

Emotion detection in text is crucial for various applications, but progress, especially in multi-label scenarios, is often hampered by data scarcity, particularly for low-resource languages like Emakhuwa and Tigrinya. This lack of data limits model performance and generalizability. To address this, the NTA team developed a system for SemEval-2025 Task 11, leveraging data augmentation techniques: swap, deletion, oversampling, emotion-focused synonym insertion and synonym replacement to enhance baseline models for multilingual textual multi-label emotion detection. Our proposed system achieved significantly higher macro F1-scores compared to the baseline across multiple languages.

pdf bib abs
Wikidata-Driven Entity-Aware Translation: Boosting LLMs with External Knowledge
Lu Xu

This paper presents an entity-aware machine translation system that significantly improves named entity translation by integrating external knowledge from Wikidata with Large Language Models (LLMs). While LLMs demonstrate strong general translation capabilities, they struggle with named entities that require specific cultural or domain knowledge. We address this challenge through two approaches: retrieving multilingual entity representations using gold Wikidata IDs, and employing Relik, an information extraction tool, to automatically detect and link entities without gold annotations. Experiments across multiple language pairs show our system outperforms baselines by up to 63 percentage points in entity translation accuracy (m-ETA) while maintaining high overall translation quality. Our approach ranked 3rd overall and 1st among non-finetuned systems on the SemEval-2025 Task 2 leaderboard. Additionally, we introduced language-specific post-processing further enhances performance, particularly for Traditional Chinese translations.

pdf bib abs
Swushroomsia at SemEval-2025 Task 3: Probing LLMs’ Collective Intelligence for Multilingual Hallucination Detection
Sandra Mitrović | Joseph Cornelius | David Kletz | Ljiljana Dolamic | Fabio Rinaldi

This paper introduces a system designed for SemEval-2025 Task 3: Mu-SHROOM, which focuses on detecting hallucinations in multilingual outputs generated by large language models (LLMs). Our approach leverages the collective intelligence of multiple LLMs by prompting several models with three distinct prompts to annotate hallucinations. These individual annotations are then merged to create a comprehensive probabilistic annotation. The proposed system demonstrates strong performance, achieving high accuracy in span detection and strong correlation between predicted probabilities and ground truth annotations.

The paper presents our system developed for SemEval-2025 Task 8, which focuses on table question answering (TQA). The TQA tasks face challenges due to the characteristics of real-world tabular data, such as large size, incomplete column semantics, and entity ambiguity. To address these issues, we propose a large language model (LLM)-powered and programming-based framework, named Flow-of-Table-Reasoning. We introduce the table schema integrating verbalized structure and semantics for query decomposition and programming, enabling a holistic understanding of tables and the ability to process large-size tables. We design a multi-step schema linking plan to derive a focused table schema that retains only information relevant to the query, aiming to eliminate ambiguity and reduce hallucinations. Furthermore, we incorporate reasoning workflow into an iterative thinking architecture, allowing incremental cycles of thinking, reasoning and reflection. Our system achieves first place on both TQA and Lite TQA subtasks.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 1: Using GPT-4o and CLIP-ViLT to Decode Figurative Language Across Text and Images
Saurav Aryal | Lawal Abdulmujeeb

Correctly identifying idiomatic expressions remains a major challenge in Natural Language Processing (NLP), as these expressions often have meanings that cannot be directly inferred from their individual words. The SemEval-2025 Task 1 introduces two subtasks, A and B, designed to test models’ ability to interpret idioms using multimodal data, including both text and images. This paper focuses on Subtask A, where the goal is to determine which among several images best represents the intended meaning of an idiomatic expression in a given sentence.To address this, we employed a two-stage approach. First, we used GPT-4o to analyze sentences, extracting relevant keywords and sentiments to better understand the idiomatic usage. This processed information was then passed to a CLIP-VIT model, which ranked the available images based on their relevance to the idiomatic expression. Our results showed that this approach performed significantly better than directly feeding sentences and idiomatic compounds into the models without preprocessing. Specifically, our method achieved a Top-1 accuracy of 0.67 in English, whereas performance in Portuguese was notably lower at 0.23. These findings highlight both the promise of multimodal approaches for idiom interpretation and the challenges posed by language-specific differences in model performance.

pdf bib abs
CSECU-DSG at SemEval-2025 Task 6: Exploiting Multilingual Feature Fusion-based Approach for Corporate Promise Verification
Tashin Hossain | Abu Nowshed Chy

In SemEval-2025, we participated on the multilingual corporate promise verification task. In the task, we mainly focused on the promise and evidence identification task, and illustrated the performance for the five different languages. For all the languages, we proposed a unified state-of-the-art framework to classify the target labels. For the framework, we incorporated the pre-feature fusion approach, then integrate it with the neural network architecture. Additionally, in the dataset description and discussion section, we provide different insights of our finding through visualization of the dataset structures and explainability of the model’s performance.

pdf bib abs
Exploration Lab IITK at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Tafazzul Nadeem | Riyansha Singh | Suyamoon Pathak | Ashutosh Modi

This paper presents our approach to SemEval-2025 Task 11 (Track A): Bridging the Gap in Text-Based Emotion Detection, with a focus on multi-label emotion classification for the English dataset. Our methodology leverages an ensemble of transformer-based models, incorporating full fine-tuning along with additional classification layers to enhance predictive performance. Through extensive experimentation, we demonstrate that fine-tuning significantly improves emotion classification accuracy compared to baseline models. Furthermore, we provide an in-depth analysis of the dataset, highlighting key patterns and challenges. The study also evaluates the impact of ensemble modeling on performance, demonstrating its effectiveness in capturing nuanced emotional expressions. Finally, we outline potential directions for further refinement and domain-specific adaptations to enhance model robustness.

pdf bib abs
Stanford MLab at SemEval-2025 Task 11: Track B–Emotion Intensity Detection
Joseph Le | Hannah Cui | James Zhang

We outline our SemEval 2025 Track B: Emotion Intensity Prediction submission, for which the objective is to predict the intensity of six primary emotions—anger, disgust, fear, joy, sadness, and surprise—between 0 and 3, with 0 being none and 3 being very strong. We used a regression fine-tuned BERT-based model that makes use of pretrained embeddings in order to sense subtle emotional wordings in text.We include tokenization with a BERT tokenizer, training with AdamW optimization, and an ExponentialLR scheduler used for learning rate modification. Performance is monitored based on validation loss and accuracy through closeness of model outputs to gold labels.Our best-performing model is 68.97% accurate in validation and has a validation loss of 0.373, demonstrating BERT’s capability in fine-grained emotion intensity prediction. Key findings include that fine-tuning transformer models with regression loss improves prediction accuracy and that early stopping and learning rate scheduling avoid overfitting.Future improvements can include larger datasets, ensemble models, or other architectures such as RoBERTa and T5. This paper shows the potential of pretrained transformers for emotion intensity estimation and lays the groundwork for future computational emotion analysis research.

pdf bib abs
Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA
Nikolas Evkarpidi | Elena Tutubalina

This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-Code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables.

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 2: Improving Machine Translation With Context-Aware Entity-Only Pre-translations with GPT4o
Saurav Aryal | Jabez Agyemang - Prempeh

This paper presents our work on a 3-Step GPT translation system developed for SemEval-2025 Task 2 to enhance the translation of named entities within machine translation. Our approach integrates (1) entity extraction via wikidata, (2) GPT-based refinement of entity translations, and (3) final context-aware GPT translation. Results from the original dataset of six languages show significant improvements in the handling of named entities compared to direct GPT-based translation baselines. We further discuss replicability, observed challenges, and outline future research directions.

pdf bib abs
Zero_Shot at SemEval-2025 Task 11: Fine-Tuning Deep Learning and Transformer-based Models for Emotion Detection in Multi-label Classification, Intensity Estimation, and Cross-lingual Adaptation
Ashraful Islam Paran | Sabik Aftahee | Md. Refaj Hossan | Jawad Hossain | Mohammed Moshiul Hoque

Language is a rich medium employed to convey emotions subtly and intricately, as abundant as human emotional experiences themselves. Emotion recognition in natural language processing (NLP) is now a core element in facilitating human-computer interaction and interpreting intricate human behavior via text. It has potential applications in every sector i.e., sentiment analysis, mental health surveillance. However, prior research on emotion recognition is primarily from high-resource languages while low-resource languages (LRLs) are not well represented. This disparity has been a limitation to the development of universally applicable emotion detection models. To address this, the SemEval-2025 Shared Task 11 focused on perceived emotions, aiming to identify the emotions conveyed by a text snippet. It includes three tracks: Multi-label Emotion Detection (Track A), Emotion Intensity (Track B), and Cross-lingual Emotion Detection (Track C). This paper explores various models, including machine learning (LR, SVM, RF, NB), deep learning (BiLSTM+CNN, BiLSTM+BiGRU), and transformer-based models (XLM-R, mBERT, ModernBERT). The results showed that XLM-R outperformed other models in Tracks A and B, while BiLSTM+CNN performed better for Track C across most languages.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 6: Using BERT Model with R-drop for Promise Verification
Dehui Deng | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang

This paper presents our participation in the SemEval-2025 task 6: multinational, multilingual, multi-industry promise verification. The SemEval-2025 Task 6 aims to extract Promise Identification, Supporting Evidence, Clarity of the Promise-Evidence Pair, and Timing for Verification from the commitments made to businesses and governments. Use these data to verify whether companies and governments have fulfilled their commitments. In this task, we participated in the English task, whichincluded analysis of numbers in the text, reading comprehension of the text content and multi-label classification. Our model introduces regularization dropout based on Bert-base to focus on the stability of non-target classes, improve the robustness of the model, and ultimately improve the indicators. Our approach obtained competitive results in subtasks.

pdf bib abs
PATeam at SemEval-2025 Task 9: LLM-Augmented Fusion for AI-Driven Food Safety Hazard Detection
Xue Wan | Fengping Su | Ling Sun | Yuyang Lin | Pengfei Chen

This paper introduces the approach we adopted for the SemEval-2025 “Food Hazard Detection” task, which aims to predict coarse-grained categories (such as “product category” and “hazard category”) and fine-grained vectors (such as specific products like “ice cream” or hazards like “salmonella”) from noisy, long-tailed text data.To address the issues of dirty data, as well as the severe long-tail distribution of text labels and length in the data, we proposed a pipeline system. This system combines data cleaning, LLM-based enhancement, label resampling, and ensemble learning to tackle data sparsity and label imbalance problems.The two subtasks have strong semantic relatedness. By integrating them into a unified multiturn dialogue framework, we fine-tuned five models using a bagging approach. Ultimately, we achieved good results in both subtasks, ranking 5th (with an F1 score of 80.17% for ST1 and 52.66% for ST2).

pdf bib abs
Howard University-AI4PC at SemEval-2025 Task 9: Using Open-weight BART-MNLI for Zero Shot Classification of Food Recall Documents
Saurav Aryal | Kritika Pant

We present our system for SemEval-2025 Task 9: Food Hazard Detection, a shared task focused on the explainable classification of food-incident reports. The task involves predicting hazard and product categories (ST1) and their exact vectors (ST2) from short texts. Our approach leverages zero-shot classification using the BART-large-MNLI model, which allows classification without task-specific fine-tuning. Our model achieves competitive performance, emphasizing hazard prediction accuracy, as evaluated by the macro-F1 score.

pdf bib abs
Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models
Enfa Fane | Mihai Surdeanu | Eduardo Blanco | Steven Corman

Understanding how news narratives frame entities is crucial for studying media’s impact on societal perceptions of events. In this paper, we evaluate the zero-shot capabilities of large language models (LLMs) in classifying framing roles. Through systematic experimentation, we assess the effects of input context, prompting strategies, and task decomposition. Our findings show that a hierarchical approach of first identifying broad roles and then fine-grained roles, outperforms single-step classification. We also demonstrate that optimal input contexts and prompts vary across task levels, highlighting the need for subtask-specific strategies. We achieve a Main Role Accuracy of 89.4% and an Exact Match Ratio of 34.5%, demonstrating the effectiveness of our approach. Our findings emphasize the importance of tailored prompt design and input context optimization for improving LLM performance in entity framing.

pdf bib
CSCU at SemEval-2025 Task 6: Enhancing Promise Verification with Paraphrase and Synthesis Augmentation: Effects on Model Performance
Kittiphat Leesombatwathana | Wisarut Tangtemjit | Dittaya Wanvarie

pdf bib abs
UB_Tel-U at SemEval-2025 Task 11: Emotions Without Borders - A Unified Framework for Multilingual Classification Using Augmentation and Ensemble
Tirana Noor Fatyanosa | Putra Pandu Adikara | Rochmanu Erfitra | Muhammad Dikna | Sari Dewi Budiwati | Cahyana Cahyana

In this SemEval 2025 Task 11 paper, we tackled three tracks: Multi-label Emotion Detection, Emotion Intensity, and Cross-lingual Emotion Detection. Our approach harnesses diverse external corpora and robust data augmentation techniques across Spanish, English, and Arabic, enhancing both the diversity and resilience of the dataset. Instead of developing separate models for each language, we merge the data into a unified multilingual dataset, enabling our model to learn cross-lingual patterns and relationships simultaneously. Our ensemble architecture integrates the multilingual strengths of XLM-RoBERTa, a zero-shot classification capability via LLaMA 3, and a specialized pretrained model fine-tuned on English emotion classification. Notably, our system achieved strong performance, ranking 13th for Afrikaans (afr) in Track A, 13th for Amharic (amh) in Track B, and 4th for Hindi (hin) in Track C.

pdf bib abs
Deep at SemEval-2025 Task 11: A Multi-Stage Approach to Emotion Detection
Dong Shenpo

This paper presents a novel text-based emotion detection approach for low-resource languages in SemEval-2025 Task 11. We fine-tuned Google Gemma 2 using tailored data augmentation and Chain-of-Thought prompting. Our method, incorporating supervised fine-tuning and model ensembling, significantly improved multi-label emotion recognition, intensity prediction, and cross-lingual performance. Results show strong performance in diverse low-resource settings. Challenges remain in fine-grained sentiment analysis. Future work will explore advanced data augmentation and knowledge transfer methods. This research demonstrates the potential of large language models for inclusive emotion analysis.

In today’s era of abundant online news, tackling the spread of deceptive content and manipulative narratives has become crucial. This paper details our system for SemEval-2025 Task 10, focusing on Subtasks 1 (Entity Framing) and 3 (Narrative Extraction). We instruct-tuned quantized Microsoft’s Phi-4 model, incorporating prompt engineering techniques to enhance performance. Our approach involved experimenting with various LLMs, including LLaMA, Phi-4, RoBERTa, and XLM-R, utilizing both quantized large models and non-quantized small models. To improve accuracy, we employed structured prompts, iterative refinement with retry mechanisms, and integrated label taxonomy information. For subtask 1, we also fine-tuned a RoBERTa classifier to predict main entity roles before classifying the fine-grained roles with Phi-4 for the English language. For subtask 3, we instruct-tuned Phi-4 to generate structured explanations, incorporating details about the article and its dominant narrative. Our system achieves competitive results in Hindi and Russian for Subtask 1.

pdf bib abs
TreeSearch at SemEval-2025 Task 8: Monte Carlo Tree Search for Question-Answering over Tabular Data
Aakarsh Nair | Huixin Yang

Large Language Models (LLMs) can answer diverse questions but often generate factually incorrect responses. SemEval-2025 Task 8 focuses on table-based question-answering, providing 65 real-world tabular datasets and 1,300 questions that require precise filtering and summarization of underlying tables.We approach this problem as a neuro-symbolic code generation task, translating natural language queries into executable Python code to ensure contextually relevant and factually accurate answers. We formulate LLM decoding as a Markov Decision Process, enabling Monte Carlo Tree Search (MCTS) as a lookahead-based planning algorithm while decoding from the underlying code-generating LLM, instead of standard beam search.Execution success on synthetic tests and real datasets serves as a reward signal, allowing MCTS to explore multiple code-generation paths, validate outcomes, assign value to partial solutions, and refine code iteratively rather than merely maximizing sequence likelihood in a single step. Our approach improves accuracy by 2.38x compared to standard decoding.

Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint where they arise. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes our solution to the shared task. We propose a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking #1 in average position across all languages.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 10: A Two-Stage Approach to Solving Multi-Label and Multi-Class Role Classification Based on DeBERTa
Ning Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang

A two-stage role classification model based on DeBERTa is proposed for the Entity Framework task in SemEval 2025 Task 10. The task is confronted with challenges such as multi-labeling, multi-category, and category imbalance, particularly in the semantic overlap and data sparsity of fine-grained roles. Existing methods primarily rely on rules, traditional machine learning, or deep learning, but the accurate classification of fine-grained roles is difficult to achieve. To address this, the proposed model integrates the deep semantic representation of the DeBERTa pre-trained language model through two sub-models: main role classification and sub-role classification, and utilizes Focal Loss to optimize the category imbalance issue. Experimental results indicate that the model achieves an accuracy of 75.32% in predicting the main role, while the exact matching rate for the sub-role is 8.94%. This is mainly limited by the strict matching standard and semantic overlap of fine-grained roles in the multi-label task. Compared to the baseline’s sub-role exact matching rate of 3.83%, the proposed model significantly improves this metric. The model ultimately ranked 23rd on the leaderboard. The code of this paper is available at:https://github.com/jiyuaner/YNU-HPCC-at-SemEval-2025-Task10.

pdf bib abs
Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction
Lev Morozov | Aleksandr Mogilevskii | Alexander Shirnin

The paper introduces EmoRAG, a retrieval-augmented emotion detection system designed for the SemEval-2025 Task 11. It uses an ensemble of models, retrieving similar examples to prompt large language models (LLMs) for emotion predictions. The retriever component fetches the most relevant examples from a database, which are then used as few-shot prompts for the models. EmoRAG achieves strong, scalable performance across languages with no training at all, demonstrating effectiveness in both high and low-resource languages.

pdf bib abs
IUST_Champs at SemEval-2025 Task 8: Structured Prompting and Retry Policy for Tabular Question Answering
Arshia Hossein Zadeh | Aysa Mayahinia | Nafiseh Ahmadi

This paper presents a novel approach to Question Answering over Tabular Data, as part of SemEval-2025 Task 8. Our system generates executable Python code to derive answers directly from structured data, leveraging open-source large language models. Key innovations include structured prompting, semantic column filtering, and a one-time retry mechanism to enhance accuracy and robustness. We evaluate our approach on the DataBench and DataBench_Lite datasets, significantly outperforming the baseline accuracy (26-27%) with our best system achieving 70.49% accuracy on the test set. Ablation studies confirm that few-shot prompting and rule-based type classification are crucial for improved performance. Despite these advancements, challenges remain in handling complex table structures and ambiguous queries. Our findings highlight the effectiveness of code-generation based methods for tabular question answering and provide insights for further research in this area.

pdf bib abs
HausaNLP at SemEval-2025 Task 11: Advancing Hausa Text-based Emotion Detection
Sani Abdullahi Sani | Salim Abubakar | Falalu Ibrahim Lawan | Abdulhamid Abubakar | Maryam Bala

This paper presents our approach to multi-label emotion detection in Hausa, a low-resource African language, as part of SemEval Track A. We fine-tuned AfriBERTa, a transformer-based model pre-trained on African languages, to classify Hausa text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our methodology involved data preprocessing, tokenization, and model fine-tuning using the Hugging Face Trainer API. The system achieved a validation accuracy of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of transformer-based models for emotion detection in low-resource languages.

pdf bib abs
indiDataMiner at SemEval-2025 Task 11: From Text to Emotion: Transformer-Based Models for Emotions Detection in Indian Languages
Saurabh Kumar | Sujit Kumar | Sanasam Ranbir Singh | Sukumar Nandi

Emotion detection is essential for applications like mental health monitoring and social media analysis, yet remains underexplored for Indian languages. This paper presents our system for SemEval-2025 Task 11 (Track A), focusing on multilabel emotion detection in Hindi and Marathi, two widely spoken Indian languages. We fine-tune IndicBERT v2 on the BRIGHTER dataset, achieving F1 scores of 87.37 (Hindi) and 88.32 (Marathi), outperforming baseline models. Our results highlight the effectiveness of fine-tuning a language-specific pretrained model for emotion detection, contributing to advancements in multilingual NLP research.

pdf bib abs
GinGer at SemEval-2025 Task 11: Leveraging Fine-Tuned Transformer Models and LoRA for Sentiment Analysis in Low-Resource Languages
Aylin Naebzadeh | Fatemeh Askari

Emotion recognition is a crucial task in natural language processing, particularly in the domain of multi-label emotion classification, where a single text can express multiple emotions with varying intensities. In this work, we participated in Task 11, Track A and Track B of the SemEval-2025 competition, focusing on emotion detection in low-resource languages. Our approach leverages transformer-based models combined with parameter-efficient fine-tuning (PEFT) techniques to effectively address the challenges posed by data scarcity. We specifically applied our method to multiple languages and achieved 9th place in the Arabic Algerian track among 40 competing teams. Our results demonstrate the effectiveness of PEFT in improving emotion recognition performance for low-resource languages. The code for our implementation is publicly available at: https://github.com/AylinNaebzadeh/Text-Based-Emotion-Detection-SemEval-2025.

pdf bib abs
YNU at SemEval-2025 Task 4: Synthetic Token Alternative Training for LLM Unlearning
Yang Chen | Zheyang Luo | Zhiwen Tang

This paper describes our system submitted to SemEval-2025 Task 4, which introduces the Synthetic Token Alternative Training (STAT) algorithm for efficient unlearning in large language models (LLMs). The proposed method aims to enable pretrained models to selectively forget designated data (the forget set) while preserving performance on the remaining data (the retain set).The STAT framework adopts a dual-stage process. In the first stage, pseudo tokens are generated through random sampling and applied to the forget set, facilitating more effective targeted unlearning. In the second stage, the model undergoes gradient-based optimization using an alternative training scheme that alternates between pseudo-token-augmented samples from the forget set and unmodified samples from the retain set. This design promotes stable unlearning of the specified data while accelerating convergence and preserving the model’s general performance.Our system achieved 3rd place in the 7B model track (OLMo-7B) and 7th place in the 1B model track (OLMo-1B), demonstrating substantial improvements over the official baselines, exhibiting superior stability in knowledge retention and more effective targeted forgetting compared to existing approaches.

pdf bib abs
TILeN at SemEval-2025 Task 11: A Transformer-Based Model for Sentiment Classification Applied to the Russian Language
Jorge Reyes - Magaña | Luis Basto - Díaz | Luis Fernando Curi - Quintal

We present our approach to tackle the Sentiment Classification Task. The task was divided into 3 categories: 1) Track A: Multi-label Emotion Detection 2) Track B: Emotion Intensity, and 3) Cross-lingual Emotion Detection. We participate in subtasks 1 and 2 for the Russian language. Our main approach is summarized as using pre-trained language models and afterwords working with fine-tuning aside the corpora provided. During the development phase, we had promising outcomes. Later during the test phase, we got similar scores to the Semeval baseline. Our approach is easy to replicate and we proportionate every detail of the process performed.

pdf bib abs
UCSC at SemEval-2025 Task 8: Question Answering over Tabular Data
Neng Wan | Sicong Huang | Esha Ubale | Ian Lane

Table question answering (Table QA) remains challenging due to the varied structures of tables and the complexity of queries, which often require specialized reasoning. We introduce a system that leverages large language models (LLMs) to generate executable code as an intermediate step for answering questions on tabular data. The methodology uniformly represents tables as dataframes and prompts an LLM to translate natural-language questions into code that can be executed on these tables. This approach addresses key challenges by handling diverse table formats, enhancing interpretability through code execution. Experimental results on the DataBench benchmarks demonstrate that the proposed code-then-execute approach achieves high accuracy. Moreover, by offloading computation to code execution, the system requires fewer LLM invocations, thereby improving efficiency. These findings highlight the effectiveness of an LLM-based coding approach for reliable, scalable, and interpretable Table QA.

pdf bib abs
JU-CSE-NLP’25 at SemEval-2025 Task 4: Learning to Unlearn LLMs
Arkajyoti Naskar | Dipankar Das | Sivaji Bandyopadhyay

Large Language Models (LLMs) have achieved enormous success recently due to their ability to understand and solve various non-trivial tasks in natural language. However, they have been shown to memorize their training data which, among other concerns, increases the risk of the model regurgitating creative or private content, potentially leading to legal issues for the model developer and/or vendors. Such issues are often discovered post-model training during testing or red teaming. While unlearning has been studied for some time in classification problems, it is still a relatively underdeveloped area of study in LLM research since the latter operates in a potentially unbounded output label space. Specifically, robust evaluation frameworks are lacking to assess the accuracy of these unlearning strategies. In this challenge, we aim to bridge this gap by developing a comprehensive evaluation challenge for unlearning sensitive datasets in LLMs.

pdf bib abs
pingan-team at SemEval-2025 Task 2: LoRA-Augmented Qwen2.5 with Wikidata-Driven Entity Translation
Diyang Chen

This paper presents our solution for SemEval-2025 Task 2 on entity-aware machine translation. We propose a parameter-efficient adaptation framework using Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-72B model, enabling effective knowledge transfer while preserving generalization capabilities. To address data scarcity and entity ambiguity, we design a Wiki-driven augmentation pipeline that leverages Wikidata’s multilingual entity mappings to generate synthetic training pairs. Our system achieves state-of-the-art performance across 10 languages, securing first place in the competition. Experimental results demonstrate significant improvements in both translation quality (COMET) and entity accuracy (M-ETA).

pdf bib abs
PoliTo at SemEval-2025 Task 1: Beyond Literal Meaning: A Chain-of-Though Approach for Multimodal Idiomacity Understanding
Lorenzo Vaiani | Davide Napolitano | Luca Cagliero

Idiomatic expressions present significant challenges for natural language understanding systems as their meaning often diverge from the literal interpretation. While prior works have focused on textual idiom detection, the role of visual content in reasoning about idiomaticity remains underexplored. This study introduces a Chain-of-Thought reasoning framework that enhances idiomatic comprehension by ranking images based on their relevance to a compound expression in context, requiring the system to distinguish between idiomatic and literal meanings.We comprehensively evaluate our approach by quantitatively analyzing the performance improvements achieved integrating textual and visual information in the ranking process through different prompting settings. Our empirical findings provide insights into the capabilities of visual Large Language Models to establish meaningful correlations between idiomatic content and its visual counterpart, suggesting promising directions for multimodal language understanding.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 1: Enhancing Multimodal Idiomaticity Representation via LoRA and Hybrid Loss Optimization
Liu Lei | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang

This study reports the YNU-HPCC team’s participation in Subtask A of SemEval-2025 Task 1 on multimodal idiomatic representation. The task requires ranking candidate images based on their semantic relevance to a target idiom within a given sentence, challenging models to disambiguate idiomatic semantics, and aligning them with abstract visual concepts across English and Portuguese. Using AltCLIP-m18 as the base model, our approach enhances its zero-shot capabilities with LoRA fine-tuning and combines ListMLE ranking optimization with Focal Loss to handle hard samples. Experimental results on the primary test set show significant improvements over the base model, with Top-1 Accuracy/DCG scores of 0.53/2.94 for English and 0.77/3.31 for Portuguese. The code is publicly available at https://github.com/1579364808/Semeval_2025_task1.

pdf bib abs
JU_NLP at SemEval-2025 Task 7: Leveraging Transformer-Based Models for Multilingual & Crosslingual Fact-Checked Claim Retrieval
Atanu Nayak | Srijani Debnath | Arpan Majumdar | Pritam Pal | Dipankar Das

Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper presents a systematic approach for the retrieval of top-k relevant fact-checks for a given post in a monolingual and cross-lingual setup using transformer-based pre-trained models fine-tuned with a dual encoder architecture. By training and evaluating the shared task test dataset, our proposed best-performing framework achieved an average success@10 score of 0.79 and 0.62 for the retrieval of 10 fact-checks from the fact-check corpus against a post in monolingual and crosslingual track respectively.

pdf bib abs
JellyK at SemEval-2025 Task 11: Russian Multi-label Emotion Detection with Pre-trained BERT-based Language Models
Khoa Le | Dang Thin

This paper presents our approach for SemEval-2025 Task 11, we focus on on multi-label emotion detection in Russian text (track A). We preprocess the data by handling special characters, punctuation, and emotive expressions to improve feature-label relationships. To select the best model performance, we fine-tune various pre-trained language models specialized in Russian and evaluate them using K-FOLD Cross-Validation. Our results indicated that ruRoberta-large achieved the best Macro F1-score among tested models. Finally, our system achieved fifth place in the unofficial competition ranking.

This paper presents a system description forthe SemEval Mu-SHROOM task, focusing ondetecting hallucination spans in the outputsof instruction-tuned Large Language Models(LLMs) across 14 languages. We comparetwo distinct approaches: Prompt-Based Ap-proach (PBA), which leverages the capabilityof LLMs to detect hallucination spans usingdifferent prompting strategies, and the Fine-Tuning-Based Approach (FBA), which fine-tunes pre-trained Language Models (LMs) toextract hallucination spans in a supervised man-ner. Our experiments reveal that PBA, espe-cially when incorporating explicit references orexternal knowledge, outperforms FBA. How-ever, the effectiveness of PBA varies across lan-guages, likely due to differences in languagerepresentation within LLMs

pdf bib abs
UCSC NLP T6 at SemEval-2025 Task 1: Leveraging LLMs and VLMs for Idiomatic Understanding
Judith Clymo | Adam Zernik | Shubham Gaur

Idiomatic expressions pose a significant challenge for natural language models due to their non-compositional nature. In this work, we address Subtask 1 of the SemEval-2025 Task 1 (ADMIRE), which requires distinguishing between idiomatic and literal usages of phrases and identify images that align with the relevant meaning.Our approach integrates large language models (LLMs) and vision-language models, and we show how different prompting techniques improve those models’ ability to identify and explain the meaning of idiomatic language.

pdf bib abs
JUNLP_Sarika at SemEval-2025 Task 11: Bridging Contextual Gaps in Text-Based Emotion Detection using Transformer Models
Sarika Khatun | Dipanjan Saha | Dipankar Das

Because language is subjective, it can be difficult to infer human emotions from textual data. This work investigates the categorization of emotions using BERT, classifying five emotions—angry, fearful, joyful, sad, and surprised—by utilizing its contextual embeddings. Preprocessing techniques like tokenization and stop-word removal are used on the dataset, which comes from social media and personal tales. With a weighted F1-score of 0.75, our model was trained using a multi-label classification strategy. BERT has the lowest F1-score when it comes to rage, but it does well when it comes to identifying fear and surprise. The findings demonstrate the difficulties presented by unbalanced datasets while also highlighting the promise of transformer-based models for text-based emotion identification. Future research will use data augmentation methods, domain-adapted BERT models, and other methods to improve classification performance.

pdf bib abs
PATeam at SemEval-2025 Task 10: Two-stage News Analytical Framework: Target-oriented Semantic Segmentation and Sequence Generation LLMs for Cross-Lingual Entity and Narrative Analysis
Ling Sun | Xue Wan | Yuyang Lin | Fengping Su | Pengfei Chen

This paper presents our approaches for three subtasks in SemEval-2025 Task 10, which focus on entity framing, narrative classification, and narrative extraction in new analysis respectively. We propose a two-stage news analytical framework for both Subtask A and B. In Subtask A (Entity Framing), we design an entity-oriented data processing pipeline to address the issue of redundant information in a news article, and explore effective use of multilingual datasets through sufficient experiments. The system achieves the first place in Bulgarian and the second place in English and Portuguese. In Subtask B (Narrative Classification), a similar narrative-oriented data processing pipeline is adopted to obtain condensed news chunks for each narrative. We conduct in-depth discussion regarding approaches to enhancing both data quality and volume, and explore one-vs-rest classification models and sequence prediction models for multi-label classification tasks. The system ranks first in Bulgarian and second in Russian and Portuguese. In Subtask 3 (Narrative Extraction), we build our system with data augmentation, supervised fine-tuning, and preference-based reinforcement learning. This system achieves the first place in Bulgarian, Russian and Hindi and the second place in Portuguese.

pdf bib abs
YNUzwt at SemEval-2025 Task 10: Tree-guided Stagewise Classifier for Entity Framing and Narrative Classification
Qiangyu Tan | Yuhang Cui | Zhiwen Tang

This paper presents a hierarchical classification framework, designated as the Tree-guided Stagewise Classifier (TGSC) , which implements a Chain-of-Thought (CoT) reasoning paradigm for addressing multi-label and multi-class classification challenges in multilingual news article analysis in SemEval-2025 Task 10. The proposed methodology leverages the zero-shot capabilities inherent in Large Language Models (LLMs) through a systematic hierarchical reasoning mechanism. This process proceeds through successive hierarchical levels, wherein the classification commences from root nodes and progressively navigates category branches via iterative determinations at each hierarchical tier, ultimately culminating in leaf category identification during the final classification stage. To optimize classification precision, a specialized prompt engineering strategy incorporating hierarchical structural constraints is developed to guide the reasoning trajectory. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance across multiple languages in Subtask 1 and Subtask 2.

pdf bib abs
PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts
Ziyi Huang | Xia Cui

This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and Contextual String Embeddings (CSEs) exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.

This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

pdf bib abs
Exploration Lab IITK at SemEval-2025 Task 8: Multi-LLM Agent QA over Tabular Data
Aditya Bangar | Ankur Kumar | Shlok Mishra | Ashutosh Modi

This paper presents our Multi-LLM Agentic System that helps solve the problem of tabular question answering as posed in the SemEval Task-8: Question Answering over Tabular Data. Our system incorporates an Agentic Workflow where we assign each agent a role along with the context from other agents to better help resolve the ambiguity. As the user poses their question along with the dataframe, we firstly try to infer the types of the columns from the dataframe and also the expected answer type given the question and the column types. Then, the planner agent gives out a plan that tells us about the steps that we have to take to get the answer. Each step is written such that it helps us write one line of Python code. Then, we call the coding agent, which attempts to write the code given the information from the previous agents. After that, we perform a debugging pass through a debugging agent, which tries to resolve the issue given the previous context and finally deliver the answer if the code runs error-free. Our system achieved 14th place on the overall open-source models track.

pdf bib abs
COGUMELO at SemEval-2025 Task 3: A Synthetic Approach to Detecting Hallucinations in Language Models based on Named Entity Recognition
Aldan Creo | Héctor Cerezo - Costas | Maximiliano Hormazábal Lagos | Pedro Alonso Doval

In this paper, we propose an approach to detecting hallucinations based on a Named Entity Recognition (NER) task.We focus on efficiency, aiming to develop a model that can detect hallucinations without relying on external data sources or expensive computations that involve state-of-the-art large language models with upwards of tens of billions of parameters. We utilize the SQuAD question answering dataset to generate a synthetic version that contains both correct and hallucinated responses and train encoder language models of a moderate size (RoBERTa and FLAN-T5) to predict spans of text that are highly likely to contain a hallucination. We test our models on a separate dataset of expert-annotated question-answer pairs and find that our approach achieves a Jaccard similarity of up to 0.358 and 0.227 Spearman correlation, which suggests that our models can serve as moderately accurate hallucination detectors, ideally as part of a detection pipeline involving human supervision. We also observe that larger models seem to develop an emergent ability to leverage their background knowledge to make more informed decisions, while smaller models seem to take shortcuts that can lead to a higher number of false positives.We make our data and code publicly accessible, along with an online visualizer. We also release our trained models under an open license.

pdf bib abs
CDHF at SemEval-2025 Task 9: A Multi-Task Learning Approach for Food Hazard Classification
Phuoc Chu

We present our system in SemEval-2025 Task 9: Food Hazard Detection. Our approach focuses on multi-label classification of food recall titles into predefined hazard and product categories. We fine-tune pre-trained transformer models, comparing BERT and BART. Our results show that BART significantly outperforms BERT, achieving an F1-score of 0.8033 during development. However, in the final evaluation phase, our system obtained an F1-score of 0.7676, ranking 54th in Subtask 1. While our performance is not among the top, our findings highlight the importance of model choice in food hazard classification. Future work can explore additional improvements, such as ensemble methods and domain adaptation

pdf bib abs
UT-NLP at SemEval-2025 Task 11: Evaluating Zero-Shot Capability of GPT-4o mini on Emotion Recognition via Role-Play and Contrastive Judging
Amirhossein Safdarian | Milad Mohammadi | Heshaam Faili

Emotion recognition in text is crucial in natural language processing but challenging in multilingual settings due to varying cultural and linguistic cues. In this study, we assess the zero-shot capability of GPT-4o Mini, a cost-efficient small-scale LLM, for multilingual emotion detection. Since small LLMs tend to perform better with task decomposition, we introduce a two-step approach: (1) Role-Play Rewriting, where the model minimally rewrites the input sentence to reflect different emotional tones, and (2) Contrastive Judging, where the original sentence is compared against these rewrites to determine the most suitable emotion label. Our approach requires no labeled data for fine-tuning or few-shot in-context learning, enabling a plug-and-play solution that can seamlessly integrate with any LLM. Results show promising performance, particularly in low-resource languages, though with a performance gap between high- and low-resource settings. These findings highlight how task decomposition techniques like role-play and contrastive judging can enhance small LLMs’ zero-shot capabilities for real-world, data-scarce scenarios.

The proliferation of multilingual misinformation demands robust systems for crosslingual fact-checked claim retrieval. This paper addresses SemEval-2025 Shared Task 7, which challenges participants to retrieve fact-checks for social media posts across 14 languages, even when posts and fact-checks are in different languages. We propose a hybrid retrieval pipeline that combines sparse lexical matching (BM25, BGE-m3) and dense semantic retrieval (pretrained and fine-tuned BGE-m3) with dynamic fusion and curriculum-trained rerankers. Our system achieves 67.2% crosslingual and 86.01% monolingual accuracy on the Shared Task MultiClaim dataset.

pdf bib abs
ScottyPoseidon at SemEval-2025 Task 8: LLM-Driven Code Generation for Zero-Shot Question Answering on Tabular Data
Raghav R | Adarsh Prakash Vemali | Darpan Aswal | Rahul Ramesh | Ayush Bhupal

Tabular Question Answering (QA) is crucial for enabling automated reasoning over structured data, facilitating efficient information retrieval and decision-making across domains like finance, healthcare, and scientific research. This paper describes our system for the SemEval 2025 Task 8 on Question Answering over Tabular Data, specifically focusing on the DataBench QA and DataBench Lite QA subtasks. Our approach involves generating Python code using Large Language Models (LLMs) to extract answers from tabular data in a zero-shot setting. We investigate both multi-step Chain-of-Thought (CoT) and unified LLM approaches, where the latter demonstrates superior performance by minimizing error propagation and enhancing system stability. Our system prioritizes computational efficiency and scalability by minimizing the input data provided to the LLM, optimizing its ability to contextualize information effectively. We achieve this by sampling a minimal set of rows from the dataset and utilizing external execution with Python and Pandas to maintain efficiency. Our system achieved the highest accuracy amongst all small open-source models, ranking 1st in both subtasks.

pdf bib abs
TECHSSN at SemEval-2025 Task 10: A Comparative Analysis of Transformer Models for Dominant Narrative-Based News Summarization
Pooja Premnath | Venkatasai Ojus Yenumulapalli | Parthiban Mohankumar | Rajalakshmi Sivanaiah | Angel Deborah S

This paper presents an approach to Task 10 of SemEval 2025, which focuses on summarizing English news articles using a given dominant narrative. The dataset comprises news articles on the Russia-Ukraine war and climate change, introducing challenges related to bias, information compression, and contextual coherence. Transformer-based models, specifically BART variants, are utilized to generate concise and coherent summaries. Our team TechSSN, achieved 4th place on the official test leaderboard with a BERTScore of 0.74203, employing the DistilBART-CNN-12-6 model.

pdf bib abs
MINDS at SemEval-2025 Task 9: Multi-Task Transformers for Food Hazard Coarse-Fine Classification
Flavio Giobergia

Food safety is a critical concern: hazardous incident reports need to be classified to be able to take appropriate measures in a timely manner. The SemEval-2025 Task 9 on Food Hazard Detection aims to classify food-related incident reports by identifying both the type of hazard and the product involved, at both coarse and fine levels of granularity. In this paper, we present our solution that approaches the problem by leveraging two independent encoder-only transformer models, each fine-tuned separately to classify hazards and food products, at the two levels of granularity of interest. Experimental results show that our approach effectively addresses the classification task, achieving high-quality performance on both subtasks. We additionally include a discussion on potential improvements for future iterations, and a brief description of failed attempts. We make the code available at https://github.com/fgiobergia/SemEval2025-Task9.

pdf bib abs
MINDS at SemEval-2025 Task 8: Question Answering Over Tabular Data via Large Language Model-generated SQL Queries
Flavio Giobergia

The growing capabilities of Large Language Models (LLMs) have opened up new opportunities for answering questions based on structured data. However, LLMs often struggle to directly handle tabular data and provide accurate, grounded answers. This paper addresses the challenge of Question Answering (QA) over tabular data, specifically in the context of SemEval-2025 Task 8. We propose an LLM-based pipeline that generates SQL queries to extract answers from tabular datasets. Our system leverages In-Context Learning to produce queries, which are then executed on structured tables, to produce the final answers. We demonstrate that our solution performs effectively in a few-shot setup and scales well across tables of different sizes. Additionally, we conduct a data-driven error analysis to highlight scenarios where the model encounters difficulties. We make the code available at https://github.com/fgiobergia/SemEval2025-Task8.

This paper presents AfroEmo, a multilingual, multi label emotion classification system designed for SemEval 2025 Task 11, leveraging the Afro XLMR model. Our approach integrates adaptive pretraining on domain specific corpora followed by fine tuning on low resource languages. Through comprehensive exploratory data analysis, we assess label distribution and model performance across diverse linguistic settings. By incorporating perceived emotions, how emotions are interpreted rather than explicitly stated, we enhance emotion recognition capabilities in underrepresented languages. Experimental results demonstrate that our method achieves competitive performance particularly in Amharic, while addressing key challenges in low resource emotion detection.

pdf bib abs
bbStar at SemEval-2025 Task 10: Improving Narrative Classification and Explanation via Fine Tuned Language Models
Rishit Tyagi | Rahul Bouri | Mohit Gupta

Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.

Our team focused on Subtask 2 (narrative classification) and tried several conceptually straightforward approaches: (1) prompt engineering of LLMs, (2) a zero-shot approach based on sentence similarities, (3) direct classification of fine-grained labels using SetFit, (4) fine-tuning encoder models on fine-grained labels, and (5) hierarchical classification using encoder models with two different classification heads. We list results for all systems on the development set, which show that the best approach was to fine-tune a pre-trained multilingual model, XLM-RoBERTa, with two additional linear layers and a softmax as classification head.

pdf bib abs
Aestar at SemEval-2025 Task 8: Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi | Mohit Gupta | Rahul Bouri

Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5% accuracy on DataBench QA and 71.6% on DataBench Lite QA, significantly surpassing baseline scores of 26% and 27% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.

pdf bib abs
HiTZ-Ixa at SemEval-2025 Task 1: Multimodal Idiomatic Language Understanding
Anar Yeginbergen | Elisa Sanchez - Bayona | Andrea Jaunarena | Ander Salaberria

In this paper, we present our approach to the AdMIRe (Advancing Multimodal Idiomaticity Representation) shared task, outlining the methodologies and strategies employed to tackle the challenges of idiomatic expressions in multimodal contexts. We discuss both successful and unsuccessful approaches, including the use of models of varying sizes and experiments involving zero- and few-shot learning. Our final submission, based on a zero-shot instruction-following vision-and-language model (VLM), achieved 13th place for the English test set and 1st place for the Portuguese test set on the preliminary leaderboard.We investigate the performance of open VLMs in this task, demonstrating that both large language models (LLMs) and VLMs exhibit strong capabilities in identifying idiomatic expressions. However, we also identify significant limitations in both model types, including instability and a tendency to generate hallucinated content, which raises concerns about their reliability in interpreting figurative language. Our findings emphasize the need for further advancements in multimodal models to improve their robustness and mitigate these issues.

pdf bib abs
KyuHyunChoi at SemEval-2025 Task 10: Narrative Extraction Using a Summarization-Specific Pretrained Model
Kyu Hyun Choi | Seung Hoon Na

Task 11 of SemEval 2025 was proposed to develop supporting information for analyzing the risks of misinformation and propaganda in news articles. In this study, we selected Sub-task 3—which involves generating evidence explaining why a particular dominant narrative is labeled in an article—and fine-tuned PEGASUS for this purpose, achieving the best performance in the competition.

pdf bib abs
Shouth NLP at SemEval-2025 Task 7: Multilingual Fact-Checking Retrieval Using Contrastive Learning
Juan Pérez | Santiago Lares

We present a multilingual fact-checking re-trieval system for the SemEval-2025 task ofmatching social media posts with relevant factchecks. Our approach utilizes a contrastivelearning framework built on the multilingual E5model architecture, fine-tuned on the provideddataset. The system achieves a Success@10score of 0.867 on the official test set, with per-formance variations between languages. Wedemonstrate that input prefixes and language-specific corpus filtering significantly improveretrieval performance. Our analysis reveals in-teresting patterns in cross-lingual transfer, withspecifically strong results on Malaysian andThai languages. We make our code public forfurther research and development.

pdf bib abs
AIMA at SemEval-2025 Task 1: Bridging Text and Image for Idiomatic Knowledge Extraction via Mixture of Experts
Arash Rasouli | Erfan Sadraiye | Omid Ghahroodi | Hamid Rabiee | Ehsaneddin Asgari

Idioms are integral components of language, playing a crucial role in understanding and processing linguistic expressions. Although extensive research has been conducted on the comprehension of idioms in the text domain, their interpretation in multi-modal spaces remains largely unexplored. In this work, we propose a multi-expert framework to investigate the transfer of idiomatic knowledge from the language to the vision modality. Through a series of experiments, we demonstrate that leveraging text-based representations of idioms can significantly enhance understanding of the visual space, bridging the gap between linguistic and visual semantics.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 8: Enhancing Question-Answering over Tabular Data with TableGPT2
Kaiwen Hu | Jin Wang | Xuejie Zhang

This paper describes our systems for SemEval 2025 Task8, Question Answering over Tabular Data. This task encourages us to develop a system that answers questions of the kind present in DataBench over day-to-day datasets, where the answer is either a number, a categorical value, a boolean value, or lists of several types. Participating in Task 8, we engage in all subtasks. The challenge lies in the multi-step reasoning process of converting natural language queries into executable code. This challenge is exacerbated by the limitations of current methods, such as chaining reasoning, which have difficulty handling complex multi-step reasoning paths due to difficulty evaluating intermediate steps. In the official ranking, we obtain a score of 65.64. On the final competition test set, our DataBench accuracy is 65.64%, and DataBench Lite accuracy is 66.62%. Both exceed the baseline (26%). The competitive results in two subtasks demonstrate the effectiveness of our systems.

pdf bib abs
QiMP at SemEval-2025 Task 11: Optimizing Text-based Emotion Classification in English Beyond Traditional Methods
Mariia Bogatyreva | Pascal Gaertner | Quim Ribas | Daryna Dementieva | Alexander Fraser

As human-machine interactions become increasingly natural through text, accurate emotion recognition is essential. Detecting emotions provides valuable insights across various applications. In this paper, we present our approach for SemEval-2025 Task 11, Track A, which focuses on multi-label text-based detection of perceived emotions. Our system was designed for and tested on English language text. To classify emotions present in text snippets, we initially experimented with traditional techniques such as Logistic Regression, Gradient Boosting, and SVM. We then explored state-of-the-art LLMs (OpenAI o1 and DeepSeek V3) before developing our final system, a fine-tuned Transformer-based model. Our best-performing approach employs an ensemble of fine-tuned DeBERTa-large instances with multiple seeds, optimized using Optuna and StratifiedKFold cross-validation. This approach achieves an F1-score of 0.75, demonstrating promising results with room for further improvement.

pdf bib abs
TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval
Prasanna Devadiga | Arya Suneesh | Pawan Rajpoot | Bharatdeep Hazarika | Aditya Baliga

We address the challenge of retrieving previously fact-checked claims in mono-lingual and cross-lingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 (~0.94) and 0.81025 on the monolingual and crosslingual test sets respectively.

This paper investigates the impact of data quality and processing strategies on emotion recognition in Brazilian Portuguese (PTBR) texts. We focus on data distribution, linguistic context, and augmentation techniques such as translation and synthetic data generation. To evaluate these aspects, we conduct experiments on the PTBR portion of the BRIGHTER dataset, a manually curated multilingual dataset containing nearly 100,000 samples, of which 4,552 are in PTBR. Our study encompasses both multi-label emotion detection (presence/absence classification) and emotion intensity prediction (0 to 3 scale), following the SemEval 2025 Track 11 setup. Results demonstrate that emotion intensity labels enhance model performance after discretization, and that smaller multilingual models can outperform larger ones in low-resource settings. Our official submission ranked 6th, but further refinements improved our ranking to 3rd, trailing the top submission by only 0.047, reinforcing the significance of a data-centric approach in emotion recognition.

pdf bib abs
Transformer25 at SemEval-2025 Task 1: A similarity-based approach
Wiebke Petersen | Lara Eulenpesch | Ann Piho | Julio Julio | Victoria Lohner

Accurately representing non-compositional language, such as idiomatic expressions, is essential to avoid misinterpretations that could affect subsequent tasks. This paper presents the submission of Transformer25 to the SemEval 2025 task on advancing the representation of multimodal idiomaticity. This challenge involves matching idiomatic expressions with corresponding image descriptions that depict their meanings.Our system utilizes BERT-based pre-trained sentence embeddings model, ChatGPT-generated definitions and preprocessing. Our final submission ranked 7th out of 9 for Subtask A. The paper provides a system description and analysis of our model, including minimal visualizations.

pdf bib abs
HITSZ-HLT at SemEval-2025 Task 8: Multi-turn Interactive Code Generation for Question Answering on Tabular Data
Jun Wang | Feng Xiong | Hongling Xu | Geng Tu | Ruifeng Xu

This paper introduces the system developed by the HITSZ-HLT team for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data.The primary objective of Table Question Answering (TableQA) is to provide accurate answers to user queries by interpreting and understanding tabular data. To address this, we propose the Multi-turn Interactive Code GeneratiOn(MICO) framework. Specifically, MICO employs code generation as proxy task for TableQA and integrates feedback from the execution of the generated code via multi-turn dialogue process, thereby guiding the model towards self-correction.Experimental results demonstrate the effectiveness of our framework, achieving notable performance with a rank of 4/38 on the DataBench and 5/38 on the DataBench lite.

pdf bib abs
MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment
Mohammad Mahdi Abootorabi | Alireza Ghahramani Kure | Mohammadali Mohammadkhani | Sina Elahimanesh | Mohammad Ali Ali Panah

This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce {textbf{TriAligner}}, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.

Unlearning is a critical capability for ensuring privacy, security, and compliance in AI systems, enabling models to forget specific data while retaining overall performance. In this work, we participated in Task 4 of SemEval 2025, which focused on unlearning across three sub-tasks: (1) long-form synthetic creative documents, (2) short-form synthetic biographies containing personally identifiable information, and (3) real documents sampled from the target model’s training dataset. We conducted four experiments, employing Supervised Fine-Tuning (SFT) and Negative Preference Optimization (NPO). Despite achieving good performance on the retain set—data that the model was supposed to remember—our findings demonstrate that these techniques did not perform well on the forget set, where unlearning was required.

pdf bib abs
RAGthoven at SemEval 2025 - Task 2: Enhancing Entity-Aware Machine Translation with Large Language Models, Retrieval Augmented Generation and Function Calling
Demetris Skottis | Gregor Karetka | Marek Suppa

This paper presents a system for SemEval 2025 Task 2 on entity-aware machine translation, integrating GPT-4o with Wikidata-based translations, retrieval augmented generation (RAG), and function calling. Implemented in RAGthoven, a lightweight yet powerful toolkit, our approach enriches source sentences with real-time external knowledge to address challenging or culturally specific named entities. Experiments on English-to-ten target languages show notable gains in translation quality, illustrating how LLM-based translation pipelines can leverage knowledge sources with minimal overhead. Its simplicity makes it a strong baseline for future research in entity-focused machine translation.

pdf bib abs
fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval
Pranshu Rastogi

The SemEval-2025 Task 7 on Multilingualand Crosslingual Fact-checked Claim Retrievalfocuses on retrieving relevant Fact-checkedclaims for social media Posts across multiplelanguages. This task is particularly challengingdue to linguistic barriers and the vast numberof languages Fact-checkers must consider.In this work, I approach the problem as aLearning-to-Rank task and solve it using abi-encoder-based model, fine-tuned on a pre-trained transformer optimized for sentence sim-ilarity. For the monolingual task, training wasperformed in both the source languages andtheir English translations. For cross-lingualretrieval, the training relied on English transla-tions.Most fine-tuned models have fewer than 500Mparameters, and the training was carried outefficiently using kaggle T4 GPUs with paral-lelization. Despite this lightweight setup, ourapproach achieved 92% Success@10 for mul-tilingual retrieval and 80% Success@10 forcross-lingual retrieval, securing 5th place inthe cross-lingual track and 10th place in themultilingual setting.

pdf bib abs
AlphaPro at SemEval-2025 Task 8: A Code Generation Approach for Question-Answering over Tabular Data
Anshuman Aryan | Laukik Wadhwa | Kalki Eshwar | Aakarsh Sinha | Durgesh Kumar

This work outlines the AlphaPro team’s solution to SemEval-2025 Task 8: Question Answering on Tabular Data. Our system utilizes a three-stage pipeline that uses natural language questions along with the table’s structural information to generate executable Python code, which is subsequently used to query the table and produce answers. The method achieves up to 67% accuracy in task data, demonstrating the feasibility of code generation for tabular question answering. The strengths and limitations of the approach are outlined and suggestions for further research are provided. The code has been made available in a public code repository to promote reproducibility and research in this area.

pdf bib abs
SHA256 at SemEval-2025 Task 4: Selective Amnesia – Constrained Unlearning for Large Language Models via Knowledge Isolation
Saransh Agrawal | Kuan - Hao Huang

Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers—simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.

Translating knowledge-intensive and entity-rich text between English and Korean requires transcreation to preserve language-specific and cultural nuances beyond literal, phonetic or word-for-word conversion. We evaluate 13 models (LLMs and MT systems) using automatic metrics and human assessment by bilingual annotators. Our findings show LLMs outperform traditional MT systems but struggle with entity translation requiring cultural adaptation. By constructing an error taxonomy, we identify incorrect responses and entity name errors as key issues, with performance varying by entity type and popularity level. This work exposes gaps in automatic evaluation metrics and hope to enable future work in completing culturally-nuanced machine translation.

pdf bib abs
silp_nlp at SemEval-2025 Task 2: An Effect of Entity Awareness in Machine Translation Using LLM
Sumit Singh | Pankaj Goyal | Uma Tiwary

In this study, we investigated the effect of entity awareness on machine translation (MT) using large language models (LLMs). Our approach utilized GPT-4o and NLLB-200, integrating named entity recognition (NER) to improve translation quality. The results indicated that incorporating entity information enhanced translation accuracy, especially when dealing with named entities. However, performance was highly dependent on the effectiveness of the NER model.

pdf bib abs
clujteam at SemEval-2025 Task 10: Finetuning SmolLM2 with Taxonomy-based Prompting for Explaining the Dominant Narrative in Propaganda Textt
Anca Marginean

XAI has been a long-standing goal of AI. Explaining why a text can be considered to have a dominant narrative, where the narrative is known, is of great importance for dealing with propaganda in news. This paper reports on the participation of the system clujteam in Subtask 3 of Task 10 of Semveal 2025. The system obtained 7th place with a value of 0.72464 for F1macro, at 0.026 distance from the 1st place. The key components of the solution are the011given taxonomy and supervised fine-tuning of SmolLM2.013

pdf bib abs
Homa at SemEval-2025 Task 5: Aligning Librarian Records with OntoAligner for Subject Tagging
Hadi Bayrami Asl Tekanlou | Jafar Razmara | Mahsa Sanaei | Mostafa Rahgouy | Hamed Babaei Giglou

This paper presents our system, Homa, for SemEval-2025 Task 5: Subject Tagging, which focuses on automatically assigning subject labels to technical records from TIBKAT using the Gemeinsame Normdatei (GND) taxonomy. We leverage OntoAligner, a modular ontology alignment toolkit, to address this task by integrating retrieval-augmented generation (RAG) techniques. Our approach formulates the subject tagging problem as an alignment task, where records are matched to GND categories based on semantic similarity. We evaluate OntoAligner’s adaptability for subject indexing and analyze its effectiveness in handling multilingual records. Experimental results demonstrate the strengths and limitations of this method, highlighting the potential of alignment techniques for improving subject tagging in digital libraries.

pdf bib abs
Jim at SemEval-2025 Task 5: Multilingual BERT Ensemble
Jim Hahn

The SemEval-2025 Task 5 calls for the utilization of LLM capabilities to apply controlled subject labels to record descriptions in the multilingual library collection of the German National Library of Science and Technology. The multilingual BERT ensemble system described herein produces subject labels for various record types, including articles, books, conference papers, reports, and theses. Results indicate that for English language article records, bidirectional encoder-only LLMs can achieve high recall in automated subject assignment.

pdf bib abs
LA²I²F at SemEval-2025 Task 5: Reasoning in Embedding Space – Fusing Analogical and Ontology-based Reasoning for Document Subject Tagging
Andrea Salfinger | Luca Zaccagna | Francesca Incitti | Gianluca De Nardi | Lorenzo Dal Fabbro | Lauro Snidaro

The LLMs4Subjects shared task invited system contributions that leverage a technical library’s tagged document corpus to learn document subject tagging, i.e., proposing adequate subjects given a document’s title and abstract. To address the imbalance of this training corpus, team LA²I²F devised a semantic retrieval-based system fusing the results of ontological and analogical reasoning in embedding vector space. Our results outperformed a naive baseline of prompting a llama 3.1-based model, whilst being computationally more efficient and competitive with the state of the art.

pdf bib abs
Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs
Osma Suominen | Juho Inkinen | Mona Lehtinen

This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.

pdf bib abs
Last Minute at SemEval-2025 Task 5: RAG System for Subject Tagging
Zahra Sarlak | Ebrahim Ansari

Last Minute at SemEval-2025 Task 5: RAG System for Subject TaggingZahra Sarlak, Ebrahim AnsariIn this study, we explore the LLMs4Subjects shared task, which focuses on leveraging retrieval-augmented generation (RAG) to enhance subject classification in technical records from the Leibniz University’s Technical Library (TIBKAT). The challenge requires participants to recommend appropriate subject headings from the GND taxonomy while processing bibliographic data in both German and English.

This paper presents MaRSI, an automatic subject indexing method designed to address the limitations of traditional manual indexing and emerging GenAI technologies. Focusing on improving indexing accuracy in cross-lingual contexts and balancing efficiency and accuracy in large-scale datasets, MaRSI mimics human reference learning behavior by constructing semantic indexes from pre-indexed document. It calculates similarity to retrieve relevant references, merges, and reorders their topics to generate index results. Experiments demonstrate that MaRSI outperforms supervised fine-tuning of LLMs on the same dataset, offering advantages in speed, effectiveness, and interpretability.

pdf bib abs
YNU-HPCC at SemEval-2025 Task 5: Contrastive Learning for GND Subject Tagging with Multilingual Sentence-BERT
Hong Jiang | Jin Wang | Xuejie Zhang

This paper describes YNU-HPCC(Alias JH) team’s participation in the sub-task 2 of the SemEval-2025 Task 5, which requires fine-tuning language models to align subject tags with the TIBKAT collection. The task presents three key challenges: cross-disciplinary document coverage, bilingual (English-German) processing requirements, and extreme classification over 200,000 GND Subjects. To address these challenges, we apply a contrastive learning framework using multilingual Sentence-BERT models, implementing two innovative training strategies: mixed-negative multi-label sampling, and single-label sampling with random negative selection. Our best-performing model achieves significant improvements of 28.6% in average recall, reaching 0.2252 on the core-test set and 0.1677 on the all-test set. Notably, we reveal model architecture-dependent response patterns: MiniLM-series models benefit from multi-label training (+33.5% zero-shot recall), while mpnet variants excel with single-label approaches (+230.3% zero-shot recall). The study further demonstrates the effectiveness of contrastive learning for multilingual semantic alignment in low-resource scenarios, providing insights for extreme classification tasks.

pdf bib abs
TartuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval
Aleksei Dorkin | Kairit Sirts

We present our submission to the Task 5 of SemEval-2025. We frame the task as an information retrieval problem, where the document content is used to retrieve subject tags from a large subject taxonomy. We leverage two types of encoder models to build a two-stage information retrieval system—a bi-encoder for coarse-grained candidate extraction at the first stage, and a cross-encoder for fine-grained re-ranking at the second stage.

pdf bib abs
silp_nlp at SemEval-2025 Task 5: Subject Recommendation With Sentence Transformer
Sumit Singh | Pankaj Goyal | Uma Tiwary

This work explored subject recommendation using sentence transformers within the SemEval-2025 Task 5 (LLMs4Subjects) challenge. Our approach leveraged embedding-based cosine similarity and hierarchical clustering to predict relevant GND subjects for TIB technical records in English and German. By experimenting with different models, including JinaAi, Distiluse-base-multilingual, and TF-IDF, we found that the JinaAi sentence transformer consistently outperformed other methods in terms of precision, recall, and F1-score.Our results highlight the effectiveness of transformer-based embeddings in semantic similarity tasks for subject classification. Additionally, hierarchical clustering helped reduce computational complexity by narrowing down candidate subjects efficiently. Despite the improvements, future work can focus on fine-tuning domain-specific embeddings, exploring knowledge graph integration, and enhancing multilingual capabilities for better generalization.

While extensive research exists on misinformation and disinformation, there is limited focus on future-oriented commitments, such as corporate ESG promises, which are often difficult to verify yet significantly impact public trust and market stability. To address this gap, we introduce the task of promise verification, leveraging natural language processing (NLP) techniques to automatically detect ESG commitments, identify supporting evidence, and evaluate the consistency between promises and evidence, while also inferring potential verification time points. This paper presents the dataset used in SemEval-2025 PromiseEval, outlines participant solutions, and discusses key findings. The goal is to enhance transparency in corporate discourse, strengthen investor trust, and support regulators in monitoring the fulfillment of corporate commitments.

We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountered claims expressed in social media posts across different languages. The task includes two subtracks: 1) a monolingual track, where social posts and claims are in the same language 2) a crosslingual track, where social posts and claims might be in different languages. A total of 179 participants registered for the task contributing to 52 test submissions. 23 out of 31 teams have submitted their system papers. In this paper, we report the best-performing systems as well as the most common and the most effective approaches across both subtracks. This shared task, along with its dataset and participating systems, provides valuable insights into multilingual claim retrieval and automated fact-checking, supporting future research in this field.

pdf bib abs
SemEval-2025 Task 8: Question Answering over Tabular Data
Jorge Osés Grijalba | L. Alfonso Ureñ - López | Eugenio Martínez Cámara | Jose Camacho - Collados

We introduce the findings and results of SemEval-2025 Task 8: Question Answering over Tabular Data. We featured two subtasks, DataBench and DataBench Lite. DataBench consists on question answering over tabular data, and DataBench Lite small comprising small datasets that might be easier to manage by current models by for example fitting them into a prompt. The task was open for any approach, but their answer has to conform to a required typing format. In this paper we present the task, analyze a number of system submissions and discuss the results. The results show how approaches leveraging LLMs dominated the task, with larger models exhibiting a considerably superior performance compared to small models.

pdf bib abs
SemEval-2025 Task 9: The Food Hazard Detection Challenge
Korbinian Randl | John Pavlopoulos | Aron Henriksson | Tony Lindgren | Juli Bakagianni

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we are gradually releasing (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

pdf bib abs
SemEval-2025 Task 2: Entity-Aware Machine Translation
Simone Conia | Min Li | Roberto Navigli | Saloni Potdar

Translating text that contains complex or challenging named entities—e.g., cultural-specific book and movie titles, location names, proper nouns, food names, etc.—remains a difficult task for modern machine translation systems, including the latest large language models. To systematically study and advance progress in this area, we organized Entity-Aware Machine Translation, or EA-MT, a shared task that evaluates how well systems handle entity translation across 10 language pairs. With EA-MT, we introduce XC-Translate, a novel gold benchmark comprising over 50K manually-translated sentences with entity names that can deviate significantly from word-to-word translations in their target languages. This paper describes the creation process of XC-Translate, provides an overview of the approaches explored by our participants, presents the main evaluation findings, and points toward open research directions, such as contextual retrieval methods for low-resource entities and more robust evaluation metrics for entity correctness. We hope that our shared task will inspire further research in entity-aware machine translation and foster the development of more culturally-accurate translation systems.

We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings.

pdf bib abs
SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library’s Open-Access Catalog
Jennifer D’souza | Sameer Sadruddin | Holger Israel | Mathias Begoin | Diana Slawig

We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into LLMs for digital library classification. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available. The dataset is available at {href{https://github.com/emotion-analysis-project/SemEval2025-task11}{SemEval2024-task 11}}.

We introduce SemEval-2025 Task 4: unlearn- ing sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) un- learn short form synthetic biographies contain- ing personally identifiable information (PII), in- cluding fake names, phone number, SSN, email and home addresses, and (3) unlearn real docu- ments sampled from the target model’s training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

Idiomatic expressions present a unique challenge in NLP, as their meanings are often notdirectly inferable from their constituent words. Despite recent advancements in Large LanguageModels (LLMs), idiomaticity remains a significant obstacle to robust semantic representation.We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models’ ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models’ representations of idiomaticity.

We introduce SemEval-2025 Task 10 on Multilingual Characterization and Extraction of Narratives from Online News, which focuses on the identification and analysis of narratives in online news media. The task is structured into three subtasks: (1) Entity Framing, to identify the roles that relevant entities play within narratives, (2) Narrative Classification, to assign documents fine-grained narratives according to a given, topic-specific taxonomy of narrative labels, and (3) Narrative Extraction, to provide a justification for the dominant narrative of the document. To this end, we analyze news articles across two critical domains, Ukraine-Russia War and Climate Change, in five languages: Bulgarian, English, Hindi, Portuguese, and Russian. This task introduces a novel multilingual and multifaceted framework for studying how online news media construct and disseminate manipulative narratives. By addressing these challenges, our work contributes to the broader effort of detecting, understanding, and mitigating the spread of propaganda and disinformation. The task attracted a lot of interest: 310 teams registered, with 66 submitting official results on the test set.