Shitong Duan
2025
Value Compass Benchmarks: A Comprehensive, Generative and Self-Evolving Platform for LLMs’ Value Evaluation
Jing Yao
|
Xiaoyuan Yi
|
Shitong Duan
|
Jindong Wang
|
Yuzhuo Bai
|
Muhua Huang
|
Yang Ou
|
Scarlett Li
|
Peng Zhang
|
Tun Lu
|
Zhicheng Dou
|
Maosong Sun
|
James Evans
|
Xing Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
As large language models (LLMs) are gradually integrated into human daily life, assessing their underlying values becomes essential for understanding their risks and alignment with specific preferences. Despite growing efforts, current value evaluation methods face two key challenges. C1. Evaluation Validity: Static benchmarks fail to reflect intended values or yield informative results due to data contamination or a ceiling effect. C2. Result Interpretation: They typically reduce the pluralistic and often incommensurable values to one-dimensional scores, which hinders users from gaining meaningful insights and guidance. To address these challenges, we present Value Compass Benchmarks, the first dynamic, online and interactive platform specially devised for comprehensive value diagnosis of LLMs. It (1) grounds evaluations in multiple basic value systems from social science; (2) develops a generative evolving evaluation paradigm that automatically creates real-world test items co-evolving with ever-advancing LLMs; (3) offers multi-faceted result interpretation, including (i) fine-grained scores and case studies across 27 value dimensions for 33 leading LLMs, (ii) customized comparisons, and (iii) visualized analysis of LLMs’ alignment with cultural values. We hope Value Compass Benchmarks serves as a navigator for further enhancing LLMs’ safety and alignment, benefiting their responsible and adaptive development.
2024
Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization
Shitong Duan
|
Xiaoyuan Yi
|
Peng Zhang
|
Yan Liu
|
Zheng Liu
|
Tun Lu
|
Xing Xie
|
Ning Gu
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) have revolutionized the role of AI, yet pose potential social risks. To steer LLMs towards human preference, alignment technologies have been introduced and gained increasing attention. Nevertheless, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy positive responses that are barely distinguishable from negative ones. Given recent LLMs’ proficiency in generating helpful responses, this work pivots towards a new research question: **can we achieve alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness?** For this purpose, we propose Distributional Dispreference Optimization (D2O), which maximizes the discrepancy between dispreferred responses and the generated non-negative ones. In this way, D2O effectively eschews harmful information without incorporating noisy positive samples, while avoiding collapse using self-generated responses as anchors. We demonstrate that D2O can be regarded as learning a distributional preference model reflecting human dispreference against negative responses, which is theoretically an upper bound of the instance-level DPO. Extensive experiments manifest that our method achieves comparable generation quality and surpasses the latest strong baselines in producing less harmful and more informative responses with better training stability and faster convergence.