2025
pdf
bib
abs
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
Xiaotian Han
|
Yiren Jian
|
Xuefeng Hu
|
Haogeng Liu
|
Yiqi Wang
|
Qihang Fan
|
Yuang Ai
|
Huaibo Huang
|
Ran He
|
Zhenheng Yang
|
Quanzeng You
Findings of the Association for Computational Linguistics: EMNLP 2025
Pre-training on large, high-quality datasets is essential for improving the reasoning abilities of Large Language Models (LLMs), particularly in specialized fields like mathematics. However, the field of Multimodal LLMs (MLLMs) lacks a comprehensive, open-source dataset for mathematical reasoning. To fill this gap, we present InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It consists of 24 million web pages, 85 million image URLs, and 40 billion text tokens, all carefully extracted and filtered from CommonCrawl. We outline our data collection and processing pipeline in detail. Models trained on InfiMM-WebMath-40B demonstrate strong performance in both text-only and multimodal settings, setting a new state-of-the-art on multimodal math benchmarks such as MathVerse and We-Math.
2024
pdf
bib
abs
DeVAn: Dense Video Annotation for Video-Language Models
Tingkai Liu
|
Yunzhe Tao
|
Haogeng Liu
|
Qihang Fang
|
Ding Zhou
|
Huaibo Huang
|
Ran He
|
Hongxia Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visual-language models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https://github.com/TK-21st/DeVAn.
pdf
bib
abs
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
Haogeng Liu
|
Quanzeng You
|
Yiqi Wang
|
Xiaotian Han
|
Bohan Zhai
|
Yongfei Liu
|
Wentao Chen
|
Yiren Jian
|
Yunzhe Tao
|
Jianbo Yuan
|
Ran He
|
Hongxia Yang
Findings of the Association for Computational Linguistics: ACL 2024
In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.