VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Kishor Datta Gupta, Roy George


Abstract
General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary. intended as a continuous, extensible monitor of differential framing under changing real-world inputs.
Anthology ID:
2026.alvr-main.15
Volume:
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
186–198
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.15/
DOI:
Bibkey:
Cite (ACL):
Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Kishor Datta Gupta, and Roy George. 2026. VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 186–198, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment (Rahman et al., ALVR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.15.pdf