Vivek Sourabh
2025
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Forrest Sheng Bao
|
Miaoran Li
|
Renyi Qu
|
Ge Luo
|
Erana Wan
|
Yujia Tang
|
Weisi Fan
|
Manveer Singh Tamber
|
Suleman Kazi
|
Vivek Sourabh
|
Mike Qi
|
Ruixuan Tu
|
Chenyu Xu
|
Matthew Gonzales
|
Ofer Mendelevitch
|
Amin Ahmad
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, most state-of-the-art hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
Search
Fix data
Co-authors
- Amin Ahmad 1
- Forrest Bao 1
- Weisi Fan 1
- Matthew Gonzales 1
- Suleman Kazi 1
- show all...