Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Hongbo Zhang; Han Cui; Guangsheng Bao; Linyi Yang; Jun Wang (王军); Yue Zhang (张岳, 章岳)

Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, Yue Zhang

Abstract

We introduce Direct Value Optimization (DVO), an innovative offline reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on 3 math reasoning, 4 commonsense reasoning, and 3 coding tasks shows that DVO consistently outperforms existing offline preference optimization techniques by a significant margin of 4% to 6%, and is competitive to online GRPO but with higher sample efficiency. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

Anthology ID:: 2025.emnlp-main.668
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13214–13227
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.668/
DOI:
Bibkey:
Cite (ACL):: Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, and Yue Zhang. 2025. Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13214–13227, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.668.pdf
Checklist:: 2025.emnlp-main.668.checklist.pdf

PDF Cite Search Checklist Fix data