VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Kishor Datta Gupta, Roy George
Abstract
General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary. intended as a continuous, extensible monitor of differential framing under changing real-world inputs.- Anthology ID:
- 2026.alvr-main.15
- Volume:
- Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
- Venues:
- ALVR | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 186–198
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.15/
- DOI:
- Cite (ACL):
- Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Kishor Datta Gupta, and Roy George. 2026. VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 186–198, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment (Rahman et al., ALVR 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.15.pdf