Erik Varecha
2025
FENJI at SemEval-2025 Task 3: Retrieval-Augmented Generation and Hallucination Span Detection
Flor Alberts
|
Ivo Bruinier
|
Nathalie Palm
|
Justin Paetzelt
|
Erik Varecha
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Large Language Models (LLMs) have significantly advanced Natural Language Processing, however, ensuring the factual reliability of these models remains a challenge, as they are prone to hallucination - generating text that appears coherent but contains innacurate or unsupported information. SemEval-2025 Mu-SHROOM focused on character-level hallucination detection in 14 languages. In this task, participants were required to pinpoint hallucinated spans in text generated by multiple instruction-tuned LLMs. Our team created a system that leveraged a Retrieval-Augmented Generation (RAG) approach and prompting a FLAN-T5 model to identify hallucination spans. Despite contradicting prior literature, our approach yielded disappointing results, underperforming all the “mark-all” baselines and failing to achieve competitive scores. Notably, removing RAG improved performance. The findings highlight that while RAG holds potential for hallucination detection, its effectiveness is heavily influenced by the retrieval component’s context-awareness. Enhancing the RAG’s ability to capture more comprehensive contextual information could improve performance across languages, making it a more reliable tool for identifying hallucination spans.