Hengli Li
2025
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
Yipeng Kang
|
Junqi Wang
|
Yexin Li
|
Mengmeng Wang
|
Wenming Tu
|
Quansen Wang
|
Hengli Li
|
Tingjun Wu
|
Xue Feng
|
Fangwei Zhong
|
Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2025
As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.
2024
MindDial: Enhancing Conversational Agents with Theory-of-Mind for Common Ground Alignment and Negotiation
Shuwen Qiu
|
Mingdian Liu
|
Hengli Li
|
Song-Chun Zhu
|
Zilong Zheng
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs – the speaker’s belief, the speaker’s prediction of the listener’s belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.