Taowen Wang
2026
On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning
Changyu Liu | Yiyang Liu | Taowen Wang | Qiao Zhuang | James Chenhao Liang | Wenhao Yang | Renjing Xu | Qifan Wang | Dongfang Liu | Cheng Han
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changyu Liu | Yiyang Liu | Taowen Wang | Qiao Zhuang | James Chenhao Liang | Wenhao Yang | Renjing Xu | Qifan Wang | Dongfang Liu | Cheng Han
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-Language-Action models have recently emerged as a powerful paradigm for general-purpose robot learning, enabling agents to map visual observations and natural-language instructions into executable robotic actions. Though popular, they are primarily trained via supervised fine-tuning or training-time reinforcement learning, requiring explicit fine-tuning phases, human interventions, or controlled data collection. Consequently, existing methods remain unsuitable for challenging simulated- or physical-world deployments, where robots must respond autonomously and flexibly to evolving environments. To address this limitation, we introduce a Test-Time Reinforcement Learning for VLAs (TT-VLA), a framework that enables on-the-fly policy adaptation during inference. TT-VLA formulates a dense reward mechanism that leverages step-by-step task-progress signals to refine action policies during test time while preserving the SFT/RL-trained priors, making it an effective supplement to current VLA models. Empirical results show that our approach enhances overall adaptability, stability, and task success in dynamic, previously unseen scenarios under simulated and real-world settings. We believe TT-VLA offers a principled step toward self-improving, deployment-ready VLAs.
2024
M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
Taowen Wang | Yiyang Liu | James Chenhao Liang | Junhan Zhao | Yiming Cui | Yuning Mao | Shaoliang Nie | Jiahao Liu | Fuli Feng | Zenglin Xu | Cheng Han | Lifu Huang | Qifan Wang | Dongfang Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Taowen Wang | Yiyang Liu | James Chenhao Liang | Junhan Zhao | Yiming Cui | Yuning Mao | Shaoliang Nie | Jiahao Liu | Fuli Feng | Zenglin Xu | Cheng Han | Lifu Huang | Qifan Wang | Dongfang Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M2PT) approach for efficient instruction tuning of MLLMs. M2PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.