DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Bo Li, Junwei Ma, Yiming Xiao, Ali Mostafavi, James Caverlee


Abstract
Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a rigorously verified benchmark of 3,000 expert-annotated questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA
Anthology ID:
2026.findings-acl.756
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15402–15427
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.756/
DOI:
Bibkey:
Cite (ACL):
Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Bo Li, Junwei Ma, Yiming Xiao, Ali Mostafavi, and James Caverlee. 2026. DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15402–15427, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management (Chen et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.756.pdf
Checklist:
 2026.findings-acl.756.checklist.pdf