HotelCheckSpan: A Benchmark Dataset for LLM Faithfulness

Patricia Schmidtova, Ondrej Dusek, Saad Mahamood


Abstract
Hallucinations are among the most persistent and challenging issues in large language model (LLM) outputs. This particularly holds in domains that combine both objective and subjective content, such as hotel descriptions, that are intended to be enticing advertisements for the hotel. Distinguishing between factual errors and interpretative exaggeration is often subtle, complicating both human and automated evaluation. To address this, we present HotelCheckSpan, the first span-level faithfulness dataset for the hotel domain. Each example aggregates one or more hotel descriptions, and human-annotated summaries are labeled with three error types: Incorrect, Misleading, and Not Checkable. By marking the precise spans where errors occur, the dataset captures fine-grained information about the nature of hallucinations and factual inconsistencies. In addition to human annotations, we collect span-level judgments from multiple LLMs, enabling direct human–model comparisons. Our analysis shows that inter-annotator agreement varies substantially across aggregation levels: example-level agreement can mask subtle span-level disagreements, while soft and hard F1 variants highlight discrepancies in both span placement and error categorization. HotelCheckSpan provides a benchmark for studying ambiguity and disagreement, validating automatic faithfulness metrics, and evaluating LLMs as judges, offering a rich resource for research on faithfulness, subjectivity, and annotation practices in mixed-content domains
Anthology ID:
2026.lrec-main.782
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
9973–9987
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.782/
DOI:
Bibkey:
Cite (ACL):
Patricia Schmidtova, Ondrej Dusek, and Saad Mahamood. 2026. HotelCheckSpan: A Benchmark Dataset for LLM Faithfulness. International Conference on Language Resources and Evaluation, main:9973–9987.
Cite (Informal):
HotelCheckSpan: A Benchmark Dataset for LLM Faithfulness (Schmidtova et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.782.pdf