Weiyun Wang
2025
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Xiangyu Zhao
|
Shengyuan Ding
|
Zicheng Zhang
|
Haian Huang
|
Maosongcao Maosongcao
|
Jiaqi Wang
|
Weiyun Wang
|
Xinyu Fang
|
Wenhai Wang
|
Guangtao Zhai
|
Hua Yang
|
Haodong Duan
|
Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.
2023
CLIPText: A New Paradigm for Zero-shot Text Classification
Libo Qin
|
Weiyun Wang
|
Qiguang Chen
|
Wanxiang Che
Findings of the Association for Computational Linguistics: ACL 2023
While CLIP models are useful for zero-shot vision-and-language (VL) tasks or computer vision tasks, little attention has been paid to the application of CLIP for language tasks. Intuitively, CLIP model have a rich representation pre-trained with natural language supervision, in which we argue that it is useful for language tasks. Hence, this work bridge this gap by investigating a CLIP model for zero-shot text classification. Specifically, we introduce CLIPText, a novel paradigm for zero-shot text classification, which reformulates zero-shot text classification into a text-image matching problem that CLIP can be applied to. In addition, we further incorporate prompt into CLIPText (Prompt-CLIPText) to better derive knowledge from CLIP. Experimental results on seven publicly available zero-shot text classification datasets show that both CLIPText and Prompt-CLIPText attain promising performance. Besides, extensive analysis further verifies that knowledge from CLIP can benefit zero-shot text classification task. We hope this work can attract more breakthroughs on applying VL pre-trained models for language tasks.
Search
Fix author
Co-authors
- Wanxiang Che (车万翔) 1
- Qiguang Chen 1
- Kai Chen 1
- Shengyuan Ding 1
- Haodong Duan 1
- show all...