View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning

Minjie Hong, Zirun Guo, Jiabao Zhang, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao


Abstract
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data but often struggle with complex reasoning. Reinforcement learning (RL) can enhance reasoning, yet it may cause performance degradation on general tasks and overthinking in MLLMs. We propose Asymmetric Policy Optimization (APO), which separates responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) dynamically adjusts the KL weight to stabilize training and preserve knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) penalizes overly long responses to reduce overthinking. Applied to Qwen2.5-VL, our model View-R1 achieves a 10.55% improvement in reasoning and outperforms larger models (7–11B) while not only maintaining but also slightly improving performance on general tasks. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. Our code is available at https://github.com/Collab-Gen/View-R1.
Anthology ID:
2026.findings-acl.1538
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30791–30803
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1538/
DOI:
Bibkey:
Cite (ACL):
Minjie Hong, Zirun Guo, Jiabao Zhang, Zehan Wang, Ziang Zhang, Tao Jin, and Zhou Zhao. 2026. View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30791–30803, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning (Hong et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1538.pdf
Checklist:
 2026.findings-acl.1538.checklist.pdf