StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs

Mujtaba Hasan

StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs

Abstract

Large Language Models (LLMs) are increasingly used to generate structured outputs—JSON objects, SQL queries, and structured records—from formal schemas. While recent advances in constrained decoding and schema-aware prompting have improved syntactic compliance, the semantic reliability of these outputs remains poorly characterized. We investigate this gap through the lens of schema drift—the inevitable evolution of database schemas in production environments through column renamings, type changes, and constraint modifications.We introduce StructHallu-Drift, a benchmark and evaluation framework for studying structured hallucinations under schema evolution. We contribute: (1) a six-category hallucination taxonomy that disentangles syntactic validity from semantic fidelity; (2) a controlled evaluation suite applying realistic schema mutations at three severity levels to established NL-to-structure datasets; and (3) a systematic evaluation of four LLMs spanning 7B to 70B parameters across three structured output tasks.Experiments on 1,200 schema–model evaluation instances reveal four key findings: (i) 39–54% of structured outputs contain at least one semantic hallucination; (ii) schema drift severity has surprisingly minimal effect on hallucination rates (∼44% across all levels, p = 0.59), suggesting imperfect schema conditioning under our prompting setup; (iii) output format is the dominant factor in generation reliability, with SQL achieving ∼85% semantic validity while schema-grounded record generation drops to 7–24%; (iv) each model exhibits a distinct hallucination fingerprint, implying that mitigation strategies must be model-specific rather than universal. We publicly release our benchmark and evaluation toolkit.

Anthology ID:: 2026.surgellm-1.22
Volume:: Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Vivek Gupta, Kaize Ding, Harsha Kokel, Yue Zhao, Amit Agarwal, Yu Wang, Michael Glass, Yu Zhang, Kavitha Srinivas, Xiusi Chen, Oktie Hassanzadeh, Qi Zhu, Shuaichen Chang, Yuan Luo
Venues:: SURGeLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 333–343
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.surgellm-1.22/
DOI:
Bibkey:
Cite (ACL):: Mujtaba Hasan. 2026. StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs. In Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026), pages 333–343, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs (Hasan, SURGeLLM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.surgellm-1.22.pdf

PDF Cite Search Fix data