Yong Hui Zhou
2025
FunghiFunghi at SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Tariq Ballout
|
Pieter Jansma
|
Nander Koops
|
Yong Hui Zhou
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Large Language Models (LLMs) often generate hallucinated content, which is factually incorrect or misleading, posing reliability challenges. The Mu-SHROOM shared task addresses hallucination detection in multilingualLLM-generated text. This study employsSpanBERT, a transformer model optimized forspan-based predictions, to identify hallucinatedspans across multiple languages. To addresslimited training data, we apply dataset augmentation through translation and synthetic generation. The model is evaluated using Intersection over Union (IoU) for span detectionand Spearman’s correlation for ranking consistency. While the model detects hallucinatedspans with moderate accuracy, it struggles withranking confidence scores. These findings highlight the need for improved probability calibration and multilingual robustness. Future workshould refine ranking methods and explore ensemble models for better performance.