Minjie Hong
2026
View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning
Minjie Hong | Zirun Guo | Jiabao Zhang | Zehan Wang | Ziang Zhang | Tao Jin | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Minjie Hong | Zirun Guo | Jiabao Zhang | Zehan Wang | Ziang Zhang | Tao Jin | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data but often struggle with complex reasoning. Reinforcement learning (RL) can enhance reasoning, yet it may cause performance degradation on general tasks and overthinking in MLLMs. We propose Asymmetric Policy Optimization (APO), which separates responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) dynamically adjusts the KL weight to stabilize training and preserve knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) penalizes overly long responses to reduce overthinking. Applied to Qwen2.5-VL, our model View-R1 achieves a 10.55% improvement in reasoning and outperforms larger models (7–11B) while not only maintaining but also slightly improving performance on general tasks. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. Our code is available at https://github.com/Collab-Gen/View-R1.
DUET: Joint Exploration of User–Item Profiles in Recommendation System
Yue Chen | Yifei Sun | Lu Wang | Fangkai Yang | Pu Zhao | Minjie Hong | Yifei Dong | Minghua He | Nan Hu | Jianjin Zhang | Zhiwei Dai | Yuefeng Zhan | Weihao Han | Hao Sun | Qingwei Lin | Weiwei Deng | Feng Sun | Qi Zhang | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yue Chen | Yifei Sun | Lu Wang | Fangkai Yang | Pu Zhao | Minjie Hong | Yifei Dong | Minghua He | Nan Hu | Jianjin Zhang | Zhiwei Dai | Yuefeng Zhan | Weihao Han | Hao Sun | Qingwei Lin | Weiwei Deng | Feng Sun | Qi Zhang | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Traditional recommendation systems represent users and items as dense vectors and learn to align them in a shared latent space for relevance estimation. Recent LLM-based recommenders instead leverage natural-language representations that are easier to interpret and integrate with downstream reasoning modules. This paper studies how to construct effective textual profiles for users and items, and how to align them for recommendation.A central difficulty is that the best profile format is not known a priori: manually designed templates can be brittle and misaligned with task objectives. Moreover, generating user and item profiles independently may produce descriptions that are individually plausible yet semantically inconsistent for a specific user–item pair. We propose Duet, an interaction-aware profile generator that jointly produces user and item profiles conditioned on both user history and item evidence. Duet follows a three-stage procedure: it first turns raw histories and metadata into compact cues, then expands these cues into paired profile prompts and then generate profiles, and finally optimizes the generation policy with reinforcement learning using downstream recommendation performance as feedback. Experiments on three real-world datasets show that Duet consistently outperforms strong baselines, demonstrating the benefits of template-free profile exploration and joint user–item textual alignment. Project page: https://duet-rec.github.io/.
2024
AudioVSR: Enhancing Video Speech Recognition with Audio Data
Xiaoda Yang | Xize Cheng | Jiaqi Duan | Hongshun Qiu | Minjie Hong | Minghui Fang | Shengpeng Ji | Jialong Zuo | Zhiqing Hong | Zhimeng Zhang | Tao Jin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xiaoda Yang | Xize Cheng | Jiaqi Duan | Hongshun Qiu | Minjie Hong | Minghui Fang | Shengpeng Ji | Jialong Zuo | Zhiqing Hong | Zhimeng Zhang | Tao Jin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Visual Speech Recognition (VSR) aims to predict spoken content by analyzing lip movements in videos. Recently reported state-of-the-art results in VSR often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are insufficient compared to the audio data. To further enhance the VSR model using the audio data, we employed a generative model for data inflation, integrating the synthetic data with the authentic visual data. Essentially, the generative model incorporates another insight, which enhances the capabilities of the recognition model. For the cross-language issue, previous work has shown poor performance with non-Indo-European languages. We trained a multi-language-family modal fusion model, AudioVSR. Leveraging the concept of modal transfer, we achieved significant results in downstream VSR tasks under conditions of data scarcity. To the best of our knowledge, AudioVSR represents the first work on cross-language-family audio-lip alignment, achieving a new SOTA in the cross-language scenario.
Search
Fix author
Co-authors
- Tao Jin 2
- Yue Chen 1
- Xize Cheng 1
- Zhiwei Dai 1
- Weiwei Deng 1
- Yifei Dong 1
- Jiaqi Duan 1
- Minghui Fang 1
- Zirun Guo 1
- Weihao Han 1
- Minghua He 1
- Zhiqing Hong 1
- Nan Hu 1
- Shengpeng Ji 1
- Qingwei Lin 1
- Hongshun Qiu 1
- Saravan Rajmohan 1
- Yifei Sun 1
- Hao Sun 1
- Feng Sun 1
- Zehan Wang 1
- Lu Wang 1
- Xiaoda Yang 1
- Fangkai Yang 1
- Yuefeng Zhan 1
- Zhimeng Zhang 1
- Jiabao Zhang 1
- Ziang Zhang 1
- Jianjin Zhang 1
- Qi Zhang 1
- Dongmei Zhang 1
- Zhou Zhao 1
- Pu Zhao 1
- Jialong Zuo 1