InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model

Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, Hongxia Yang


Abstract
In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.
Anthology ID:
2024.findings-acl.27
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
485–492
Language:
URL:
https://aclanthology.org/2024.findings-acl.27
DOI:
10.18653/v1/2024.findings-acl.27
Bibkey:
Cite (ACL):
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang. 2024. InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model. In Findings of the Association for Computational Linguistics: ACL 2024, pages 485–492, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model (Liu et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.findings-acl.27.pdf
Video:
 https://preview.aclanthology.org/add_acl24_videos/2024.findings-acl.27.mp4