Yujun Wang
Also published as: 钰君 王
2025
LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering
Jinhe Bi
|
Yujun Wang
|
Haokun Chen
|
Xun Xiao
|
Artur Hecker
|
Volker Tresp
|
Yunpu Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) enhance visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, enables instruction following and in-context learning, while the visual modality boosts downstream task performance through rich semantic content, spatial information, and grounding capabilities. These modalities work synergistically across various visual tasks. Our research reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning, regardless of using full or parameter-efficient fine-tuning (PEFT). We found that re-balancing these modalities can significantly reduce trainable parameters, inspiring further optimization of visual instruction tuning. To this end, we introduce Modality Linear Representation-Steering (MoReS), which re-balances intrinsic modalities by steering visual representations through linear transformations in the visual subspace across each model layer. We validated our approach by developing LLaVA Steering, a suite of models using MoReS. Results show that LLaVA Steering requires, on average, 500 times fewer trainable parameters than LoRA while maintaining comparable performance across three visual benchmarks and eight visual question-answering tasks. Finally, we introduce the LLaVA Steering Factory, a platform that enables rapid customization of MLLMs with a component-based architecture, seamlessly integrating state-of-the-art models and evaluating intrinsic modality imbalance. This open-source project facilitates a deeper understanding of MLLMs within the research community.
2024
基于大型语言模型的中文空间语义评测
Shitu Huo (霍世图)
|
Yujun Wang (王钰君)
|
Tongjie Wu (吴童杰)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“本研究的任务旨在让大模型进行实体识别、角色识别、异常识别、信息推理、同义识别任务,综合评估大模型的空间语义理解能力。其中,我们使用普通提示词、工作流提示词和思维链三种提示词策略来探讨大模型的空间语义理解能力,最后发现ERNIE-4在1-shot的普通提示词上表现最佳。最终,我们的方法排名第六,总体准确率得分为56.20%。”
Search
Fix author
Co-authors
- Jinhe Bi 1
- Haokun Chen 1
- Artur Hecker 1
- Shitu Huo (霍世图) 1
- Yunpu Ma 1
- show all...