Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models
Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, Bing Qin
Abstract
Expanding the understanding capabilities of multi-modal large language models (MLLMs) for infrared modality is a challenge due to the single-modality nature and limited amount of training data. Existing methods typically construct a uniform embedding space for cross-modal alignment and leverage abundant visual image data to indirectly understand infrared images. However, they ignore the supervisory signals of infrared-modality-specific attributes, which may lead to biased understanding of infrared images. To address this issue, we propose a debating multi-agent generation system which transfers knowledge from visible images to generate infrared image-text pairs and infrared instruction data. Moreover, we construct an infrared question-answering benchmark based on common infrared tasks. Experimental results from incremental fine-tuning on existing models and our Infrared-LLaVA-7B trained from scratch on infrared data demonstrate the effectiveness of the generated data and the feasibility of the generation approach.- Anthology ID:
- 2024.findings-emnlp.501
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8573–8591
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.501/
- DOI:
- 10.18653/v1/2024.findings-emnlp.501
- Cite (ACL):
- Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. 2024. Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models (Jiang et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.501.pdf