2025
pdf
bib
abs
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji
|
Donghai Hong
|
Borong Zhang
|
Boyuan Chen
|
Josef Dai
|
Boren Zheng
|
Tianyi Alex Qiu
|
Jiayi Zhou
|
Kaile Wang
|
Boxun Li
|
Sirui Han
|
Yike Guo
|
Yaodong Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
pdf
bib
abs
Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA
Chi-Min Chan
|
Chunpu Xu
|
Junqi Zhu
|
Jiaming Ji
|
Donghai Hong
|
Pengcheng Wen
|
Chunyang Jiang
|
Zhen Ye
|
Yaodong Yang
|
Wei Xue
|
Sirui Han
|
Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.
pdf
bib
abs
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Haonan Li
|
Xudong Han
|
Zenan Zhai
|
Honglin Mu
|
Hao Wang
|
Zhenxuan Zhang
|
Yilin Geng
|
Shom Lin
|
Renxi Wang
|
Artem Shelmanov
|
Xiangyu Qi
|
Yuxia Wang
|
Donghai Hong
|
Youliang Yuan
|
Meng Chen
|
Haoqin Tu
|
Fajri Koto
|
Cong Zeng
|
Tatsuki Kuribayashi
|
Rishabh Bhardwaj
|
Bingchen Zhao
|
Yawen Duan
|
Yi Liu
|
Emad A. Alghamdi
|
Yaodong Yang
|
Yinpeng Dong
|
Soujanya Poria
|
Pengfei Liu
|
Zhengzhong Liu
|
Hector Xuguang Ren
|
Eduard Hovy
|
Iryna Gurevych
|
Preslav Nakov
|
Monojit Choudhury
|
Timothy Baldwin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.