Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models

Shixin Jiang; Zerui Chen; Jiafeng Liang; Yanyan Zhao; Ming Liu; Bing Qin (秦兵)

doi:10.18653/v1/2024.findings-emnlp.501

Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models

Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, Bing Qin

Abstract

Expanding the understanding capabilities of multi-modal large language models (MLLMs) for infrared modality is a challenge due to the single-modality nature and limited amount of training data. Existing methods typically construct a uniform embedding space for cross-modal alignment and leverage abundant visual image data to indirectly understand infrared images. However, they ignore the supervisory signals of infrared-modality-specific attributes, which may lead to biased understanding of infrared images. To address this issue, we propose a debating multi-agent generation system which transfers knowledge from visible images to generate infrared image-text pairs and infrared instruction data. Moreover, we construct an infrared question-answering benchmark based on common infrared tasks. Experimental results from incremental fine-tuning on existing models and our Infrared-LLaVA-7B trained from scratch on infrared data demonstrate the effectiveness of the generated data and the feasibility of the generation approach.

Anthology ID:: 2024.findings-emnlp.501
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8573–8591
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.501/
DOI:: 10.18653/v1/2024.findings-emnlp.501
Bibkey:
Cite (ACL):: Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. 2024. Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models (Jiang et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.501.pdf

PDF Cite Search Fix data