InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, Hongxia Yang
Abstract
In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.- Anthology ID:
- 2024.findings-acl.27
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 485–492
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.27
- DOI:
- 10.18653/v1/2024.findings-acl.27
- Cite (ACL):
- Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang. 2024. InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model. In Findings of the Association for Computational Linguistics: ACL 2024, pages 485–492, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model (Liu et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2024.findings-acl.27.pdf