Leila Khalatbari
2025
High-Dimension Human Value Representation in Large Language Models
Samuel Cahyawijaya
|
Delong Chen
|
Yejin Bang
|
Leila Khalatbari
|
Bryan Wilie
|
Ziwei Ji
|
Etsuko Ishii
|
Pascale Fung
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, such as Reinforcement Learning with Human Feedback (RLHF), constitutional learning, and safety fine-tuning etc., there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVar, a high-dimensional neural representation of symbolic human value distributions in LLMs, orthogonal to model architecture and training data. This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs and evaluated on 15 open-source and commercial LLMs. Through UniVar, we visualize and explore how LLMs prioritize different values in 25 languages and cultures, shedding light on the complex interplay between human values and language modeling.
2024
Flatness-Aware Gradient Descent for Safe Conversational AI
Leila Khalatbari
|
Saeid Hosseini
|
Hossein Sameti
|
Pascale Fung
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
As generative dialog models become ubiquitous in real-world applications, it is paramount to ensure a harmless generation. There are two major challenges when enforcing safety to open-domain chatbots. Firstly, it is impractical to provide training data reflecting the desired response to all emerging forms of toxicity (generalisation challenge). Secondly, implementing safety features may compromise the quality of the conversation (trade-off challenge). To tackle the challenges, this paper introduces a regularized fine-tuning approach called FlatGD. By employing a safety-tailored loss, we translate better optimization to more safety. To ensure better optimization, FlatGD penalizes sharp trajectories of loss curve, encouraging flatness of the converged local minima. Experimental results on datasets of “BAD” and “prosocial dialog” demonstrate that our model outperforms the current baselines in reducing toxicity while preserving the conversation quality. Moreover, compared to other baselines, FlatGD can better generalize to unseen toxic data.
Search
Fix data
Co-authors
- Pascale Fung 2
- Yejin Bang 1
- Samuel Cahyawijaya 1
- Delong Chen 1
- Saeid Hosseini 1
- show all...