Yuyan Bu


2025

Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models (LLMs) with human values. However, it has been noted that reward models in RLHF often exhibit unintended biases, such as an overemphasis on response length based on the erroneous assumption that longer responses are universally preferred. This “length bias” can lead to excessively verbose responses that compromise the quality of LLMs alignment. Previous efforts to mitigate length bias in reward models have inadvertently decreased their accuracy by neglecting the legitimate influence of response length on human preferences. In this work, we argue that response length is a context-specific factor in human evaluations, with different queries naturally eliciting varying preferences for response length. We propose an adaptive approach to modeling length preference that dynamically adjusts the influence of response length in reward evaluations according to the context of the query. Experimental results demonstrate that our adaptive approach effectively balances the mitigation of undesired length hacking and alignment accuracy, reducing unnecessary verbosity while improving overall response quality.