Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, Joakim Nivre
Abstract
A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as “hallucination”. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and languages and we investigate the impact of model size, instruction-tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.- Anthology ID:
- 2025.gem-1.13
- Volume:
- Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria and virtual meeting
- Editors:
- Kaustubh Dhole, Miruna Clinciu
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 161–177
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.13/
- DOI:
- Cite (ACL):
- Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, and Joakim Nivre. 2025. Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 161–177, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? (Gogoulou et al., GEM 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.13.pdf