Ziang Zhang

2026

Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data but often struggle with complex reasoning. Reinforcement learning (RL) can enhance reasoning, yet it may cause performance degradation on general tasks and overthinking in MLLMs. We propose Asymmetric Policy Optimization (APO), which separates responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) dynamically adjusts the KL weight to stabilize training and preserve knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) penalizes overly long responses to reduce overthinking. Applied to Qwen2.5-VL, our model View-R1 achieves a 10.55% improvement in reasoning and outperforms larger models (7–11B) while not only maintaining but also slightly improving performance on general tasks. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. Our code is available at https://github.com/Collab-Gen/View-R1.

2025

pdf bib abs

3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. With limited data, Chat-3D achieves a 82.2% relative score compared with GPT-4 on the constructed instruction dataset, and comparable performance to state-of-the-art LLM-based methods.

Co-authors

Haifeng Huang 1

Jiabao Zhang 1

Yang Zhao 1

Venues

Findings2

Fix author