Daoan Zhang


2025

pdf bib
How LLMs React to Industrial Spatio-Temporal Data? Assessing Hallucination with a Novel Traffic Incident Benchmark Dataset
Qiang Li | Mingkun Tan | Xun Zhao | Dan Zhang | Daoan Zhang | Shengzhao Lei | Anderson S. Chu | Lujun Li | Porawit Kamnoedboon
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Large language models (LLMs) hold revolutionary potential to digitize and enhance the Health & Public Services (H&PS) industry. Despite their advanced linguistic abilities, concerns about accuracy, stability, and traceability still persist, especially in high-stakes areas such as transportation systems. Moreover, the predominance of English in LLM development raises questions about how they perform in non-English contexts. This study originated from a real world industrial GenAI application, introduces a novel cross-lingual benchmark dataset comprising nearly 99,869 real traffic incident records from Vienna (2013-2023) to assess the robustness of state-of-the-art LLMs ( 9) in the spatio vs temporal domain for traffic incident classification. We then explored three hypotheses — sentence indexing, date-to-text conversion, and German-to-English translation — and incorporated Retrieval Augmented Generation (RAG) to further examine the LLM hallucinations in both spatial and temporal domain. Our experiments reveal significant performance disparities in the spatio-temporal domain and demonstrate what types of hallucinations that RAG can mitigate and how it achieves this. We also provide open access to our H&PS traffic incident dataset, with the project demo and code available at Website https://sites.google.com/view/llmhallucination/home