Timo Strijbis


2025

pdf bib
HalluRAG-RUG at SemEval-2025 Task 3: Using Retrieval-Augmented Generation for Hallucination Detection in Model Outputs
Silvana Abdi | Mahrokh Hassani | Rosalien Kinds | Timo Strijbis | Roman Terpstra
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Large Language Models (LLMs) suffer from a critical limitation: hallucinations, which refers to models generating fluent but factually incorrect text. This paper presents our approach to hallucination detection in English model outputs as part of the SemEval-2025 Task 3 (Mu-SHROOM). Our method, HalluRAG-RUG, integrates Retrieval-Augmented Generation (RAG) using Llama-3 and prediction models using token probabilities and semantic similarity. We retrieved relevant factual information using a named entity recognition (NER)-based Wikipedia search and applied abstractive summarization to refine the knowledge base. The hallucination detection pipeline then used this retrieved knowledge to identify inconsistent spans in model-generated text. This result was combined with the results of two systems which identified hallucinations based on token probabilities and low-similarity sentences. Our system placed 33rd out of 41, performing slightly below the ‘mark all’ baseline but surpassing the ‘mark none’ and ‘neural’ baselines with an IoU of 0.3093 and a correlation of 0.0833.