Chi-Min Chan
2026
Omni-RewardBench: Toward a Comprehensive Evaluation of Generative Reward Models Across Modalities
Chi-Min Chan | Yujin Zhou | Pengcheng Wen | Boqin Yin | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chi-Min Chan | Yujin Zhou | Pengcheng Wen | Boqin Yin | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of Omni-modality Large Language Models (OLLMs) capable of jointly processing text, audio, and visual inputs marks a major step toward general intelligence. Ensuring their alignment with human preferences requires effective Omni-modality Reward Models (ORMs), which serve as surrogates for human judgment to guide OLLMs behavior. However, ORMs evaluation remains underdeveloped in the previous literature. Existing benchmarks are largely text-centric or limited to bimodal tasks, restricting comprehensive assessment for ORMs. To bridge this gap, we introduce Omni-RewardBench, the first benchmark for comprehensive evaluation of ORMs across modalities. In short, our contributions are threefold: (1) a hybrid automatic-annotation and human-verification pipeline to construct high-quality evaluation data; (2) extensive experiments on 20+ models, including inherently omni-modal and modality-bridged systems. Our experimental results demonstrate that current OLLMs fall short as reward models, revealing several common failure modes such as perception failure, modality dominance failure, and cross-modal fusion failure; and (3) strong correlations between Omni-RewardBench scores and downstream performance (IID r = 0.94, OOD r = 0.72), validating its reliability as a predictor of real-world capability and alignment quality.
Benchmarking Fine-Grained Error Detection in Multimodal Reasoning
Chi-Min Chan | Han Zhu | Chunyang Jiang | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chi-Min Chan | Han Zhu | Chunyang Jiang | Jiaming Ji | Juntao Dai | Wei Xue | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Process Reward Models (MPRMs) have emerged as a pivotal framework for enhancing the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, the research community currently lacks a dedicated benchmark to rigorously assess the error discernment capabilities of these models.To address this gap, we introduce PRMBench-V, a novel benchmark specifically designed to evaluate MPRMs’ proficiency in detecting erroneous reasoning steps across diverse error categories. Leveraging a semi-automated annotation pipeline augmented with human verification, we construct a comprehensive dataset comprising 907 unique queries, each annotated with nine distinct error types, resulting in 8,163 test cases with fine-grained step-level error labels.Through extensive experiments involving over 15 open- and closed-source models, we uncover several key findings: (1) even the strongest existing MPRMs achieve only \textasciitilde30% accuracy in error identification; (2) while partial error detection achieves moderate precision and recall (\textasciitilde60%), overall accuracy remains low (\textasciitilde20%); and (3) benchmark scores exhibit a strong correlation with downstream task performance gains (r=0.86). Furthermore, we demonstrate that PRMBench-V can inform the development of more robust MPRMs: by introducing the Bayesian Rater Reliability Process Reward Model (BR2-PRM), we achieve up to a 4.8% performance improvement through test-time scaling.We believe that PRMBench-V will serve as a valuable resource for advancing MPRM research, enabling more rigorous evaluation and fostering the development of models with fine-grained multimodal reasoning capabilities.
SafeMT: Multi-turn Safety for Multimodal Language Models
Han Zhu | Juntao Dai | Jiaming Ji | Haoran Li | Chengkun Cai | Pengcheng Wen | Chi-Min Chan | Boyuan Chen | Yaodong Yang | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Han Zhu | Juntao Dai | Jiaming Ji | Haoran Li | Chengkun Cai | Pengcheng Wen | Chi-Min Chan | Boyuan Chen | Yaodong Yang | Sirui Han | Yike Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn Attack Success Rate (ASR) compared to existed guard models.
2025
PIP: Perturbation-based Iterative Pruning for Large Language Models
Yi Cao | Wei-Jie Xu | Yucheng Shen | Weijie Shi | Chi-Min Chan | Jianfeng Qu | Jiajie Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
Yi Cao | Wei-Jie Xu | Yucheng Shen | Weijie Shi | Chi-Min Chan | Jianfeng Qu | Jiajie Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
The rapid increase in the parameter counts of Large Language Models (LLMs), which often reach into the billions or even trillions, presents significant challenges for their practical deployment, particularly in resource-constrained environments. To address this issue, we propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize LLMs, which combines information from two different views: the unperturbed view and the perturbed view. With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model’s accuracy across varied benchmarks. In some cases, the performance of the pruned model is within 5% of the unpruned version, demonstrating PIP’s ability to preserve key aspects of model effectiveness. Moreover, PIP consistently outperforms existing state-of-the-art (SOTA) structured pruning methods, establishing it as a leading technique for optimizing LLMs in constrained environments.
Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA
Chi-Min Chan | Chunpu Xu | Junqi Zhu | Jiaming Ji | Donghai Hong | Pengcheng Wen | Chunyang Jiang | Zhen Ye | Yaodong Yang | Wei Xue | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
Chi-Min Chan | Chunpu Xu | Junqi Zhu | Jiaming Ji | Donghai Hong | Pengcheng Wen | Chunyang Jiang | Zhen Ye | Yaodong Yang | Wei Xue | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.
Graceful Forgetting in Generative Language Models
Chunyang Jiang | Chi-Min Chan | Yiyang Cai | Yulong Liu | Wei Xue | Yike Guo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chunyang Jiang | Chi-Min Chan | Yiyang Cai | Yulong Liu | Wei Xue | Yike Guo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
2023
Exploring the Impact of Model Scaling on Parameter-Efficient Tuning
Yusheng Su | Chi-Min Chan | Jiali Cheng | Yujia Qin | Yankai Lin | Shengding Hu | Zonghan Yang | Ning Ding | Xingzhi Sun | Guotong Xie | Zhiyuan Liu | Maosong Sun
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Yusheng Su | Chi-Min Chan | Jiali Cheng | Yujia Qin | Yankai Lin | Shengding Hu | Zonghan Yang | Ning Ding | Xingzhi Sun | Guotong Xie | Zhiyuan Liu | Maosong Sun
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Parameter-efficient tuning (PET) methods can effectively drive extremely large pre-trained language models (PLMs) by training only minimal parameters. Different PET methods utilize different manually designed tunable modules. In small PLMs, there are usually noticeable performance differences among PET methods. Nevertheless, as the model scale increases, the performance differences become marginal. Hence, we hypothesize that model scaling mitigates the impact of design differences on PET methods. To investigate this hypothesis, we introduce a more flexible PET method called Arbitrary PET (APET) method. The APET method is compatible with a tunable module, which consists of any number of parameters distributed in arbitrary positions. Then, we utilize it and conduct experiments on 11 NLP tasks across 3 representative PLMs. Our investigations reveal that model scaling (1) mitigates the effects of the positions of tunable parameters on performance, and (2) enables tuning methods to achieve performance comparable to full-parameter fine-tuning by optimizing fewer tunable parameters. Intriguingly, we also observe that tuning methods optimize the similar number of tunable parameters to exceed random guess performance on different tasks. We collectively discuss this phenomenon and the two aforementioned findings from an optimization perspective to understand the underlying mechanisms. These conclusions enhance our understanding of the impact of model scaling on PET and assist in designing more effective and efficient PET methods for PLMs of different scales. The source code can be obtained from this GitHub repository: https://github.com/yushengsu-thu/PET_Scaling.
Plug-and-Play Document Modules for Pre-trained Models
Chaojun Xiao | Zhengyan Zhang | Xu Han | Chi-Min Chan | Yankai Lin | Zhiyuan Liu | Xiangyang Li | Zhonghua Li | Zhao Cao | Maosong Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chaojun Xiao | Zhengyan Zhang | Xu Han | Chi-Min Chan | Yankai Lin | Zhiyuan Liu | Xiangyang Li | Zhonghua Li | Zhao Cao | Maosong Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large-scale pre-trained models (PTMs) have been widely used in document-oriented NLP tasks, such as question answering. However, the encoding-task coupling requirement results in the repeated encoding of the same documents for different tasks and queries, which is highly computationally inefficient. To this end, we target to decouple document encoding from downstream tasks, and propose to represent each document as a plug-and-play document module, i.e., a document plugin, for PTMs (PlugD). By inserting document plugins into the backbone PTM for downstream tasks, we can encode a document one time to handle multiple tasks, which is more efficient than conventional encoding-task coupling methods that simultaneously encode documents and input queries using task-specific encoders. Extensive experiments on 8 datasets of 4 typical NLP tasks show that PlugD enables models to encode documents once and for all across different scenarios. Especially, PlugD can save 69% computational costs while achieving comparable performance to state-of-the-art encoding-task coupling methods. Additionally, we show that PlugD can serve as an effective post-processing way to inject knowledge into task-specific models, improving model performance without any additional model training. Our code and checkpoints can be found in https://github.com/thunlp/Document-Plugin.
2022
On Transferability of Prompt Tuning for Natural Language Processing
Yusheng Su | Xiaozhi Wang | Yujia Qin | Chi-Min Chan | Yankai Lin | Huadong Wang | Kaiyue Wen | Zhiyuan Liu | Peng Li | Juanzi Li | Lei Hou | Maosong Sun | Jie Zhou
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Yusheng Su | Xiaozhi Wang | Yujia Qin | Chi-Min Chan | Yankai Lin | Huadong Wang | Kaiyue Wen | Zhiyuan Liu | Peng Li | Juanzi Li | Lei Hou | Maosong Sun | Jie Zhou
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Prompt tuning (PT) is a promising parameter-efficient method to utilize extremely large pre-trained language models (PLMs), which can achieve comparable performance to full-parameter fine-tuning by only tuning a few soft prompts. However, PT requires much more training time than fine-tuning. Intuitively, knowledge transfer can help to improve the efficiency. To explore whether we can improve PT via prompt transfer, we empirically investigate the transferability of soft prompts across different downstream tasks and PLMs in this work. We find that (1) in zero-shot setting, trained soft prompts can effectively transfer to similar tasks on the same PLM and also to other PLMs with a cross-model projector trained on similar tasks; (2) when used as initialization, trained soft prompts of similar tasks and projected prompts of other PLMs can significantly accelerate training and also improve the performance of PT. Moreover, to explore what decides prompt transferability, we investigate various transferability indicators and find that the overlapping rate of activated neurons strongly reflects the transferability, which suggests how the prompts stimulate PLMs is essential. Our findings show that prompt transfer is promising for improving PT, and further research shall focus more on prompts’ stimulation to PLMs. The source code can be obtained from https://github.com/thunlp/Prompt-Transferability.
Search
Fix author
Co-authors
- Yike Guo 5
- Sirui Han 4
- Jiaming Ji 4
- Juntao Dai 3
- Chunyang Jiang 3
- Yankai Lin (林衍凯) 3
- Maosong Sun (孙茂松) 3
- Pengcheng Wen 3
- Wei Xue 3
- Zhiyuan Liu 2
- Yujia Qin 2
- Yusheng Su 2
- Yaodong Yang (杨耀东) 2
- Han Zhu 2
- Chengkun Cai 1
- Yiyang Cai 1
- Yi Cao 1
- Zhao Cao 1
- Boyuan Chen (陈博远) 1
- Jiali Cheng 1
- Ning Ding 1
- Xu Han 1
- Donghai Hong 1
- Lei Hou 1
- Shengding Hu 1
- Haoran Li 1
- Juanzi Li 1
- Peng Li 1
- Xiangyang Li 1
- Zhonghua Li 1
- Yulong Liu 1
- Zhiyuan Liu 1
- Jianfeng Qu 1
- Yucheng Shen 1
- Weijie Shi 1
- Xingzhi Sun 1
- Huadong Wang 1
- Xiaozhi Wang 1
- Kaiyue Wen 1
- Chaojun Xiao 1
- Guotong Xie 1
- Chunpu Xu 1
- Jiajie Xu 1
- Wei-Jie Xu 1
- Wei Xue 1
- Zonghan Yang 1
- Zhen Ye 1
- Boqin Yin 1
- Zhengyan Zhang 1
- Jie Zhou 1
- Yujin Zhou 1
- Junqi Zhu 1