Yaodong Yang


2025

pdf bib
Language Models Resist Alignment: Evidence From Data Compression
Jiaming Ji | Kaile Wang | Tianyi Alex Qiu | Boyuan Chen | Jiayi Zhou | Changye Li | Hantao Lou | Josef Dai | Yunhuai Liu | Yaodong Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

pdf bib
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji | Donghai Hong | Borong Zhang | Boyuan Chen | Josef Dai | Boren Zheng | Tianyi Alex Qiu | Jiayi Zhou | Kaile Wang | Boxun Li | Sirui Han | Yike Guo | Yaodong Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.

pdf bib
Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA
Chi-Min Chan | Chunpu Xu | Junqi Zhu | Jiaming Ji | Donghai Hong | Pengcheng Wen | Chunyang Jiang | Zhen Ye | Yaodong Yang | Wei Xue | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025

The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.

pdf bib
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao | Han Zhu | Jiaming Ji | Qichao Sun | Zhenghao Zhu | Wu Yinyu | Josef Dai | Yaodong Yang | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025

With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.

pdf bib
Reward Generalization in RLHF: A Topological Perspective
Tianyi Alex Qiu | Fanzhi Zeng | Jiaming Ji | Dong Yan | Kaile Wang | Jiayi Zhou | Yang Han | Josef Dai | Xuehai Pan | Yaodong Yang
Findings of the Association for Computational Linguistics: ACL 2025

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of **reward generalization** in reinforcement learning from human feedback (RLHF), focusing on the **topology of information flow** at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present *induced Bayesian networks* to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose **reward modeling from tree-structured preference information**. It is shown to reduce reward uncertainty by up to 𝛩(log n/loglog n) times compared to baselines, where n is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization *for free* via topology design, while *reducing* the amount of data requiring annotation.

pdf bib
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju | Weijie Shi | Chengzhong Liu | Jiaming Ji | Jipeng Zhang | Ruiyuan Zhang | Jiajie Xu | Yaodong Yang | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025

Do Large Language Models (LLMs) hold positions that conflict with your country’s values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting values. We conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs’ values with the target country.

pdf bib
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Haonan Li | Xudong Han | Zenan Zhai | Honglin Mu | Hao Wang | Zhenxuan Zhang | Yilin Geng | Shom Lin | Renxi Wang | Artem Shelmanov | Xiangyu Qi | Yuxia Wang | Donghai Hong | Youliang Yuan | Meng Chen | Haoqin Tu | Fajri Koto | Cong Zeng | Tatsuki Kuribayashi | Rishabh Bhardwaj | Bingchen Zhao | Yawen Duan | Yi Liu | Emad A. Alghamdi | Yaodong Yang | Yinpeng Dong | Soujanya Poria | Pengfei Liu | Zhengzhong Liu | Hector Xuguang Ren | Eduard Hovy | Iryna Gurevych | Preslav Nakov | Monojit Choudhury | Timothy Baldwin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.