Yuki Ichihara
2025
Theoretical Guarantees for Minimum Bayes Risk Decoding
Yuki Ichihara
|
Yuu Jinnai
|
Kaito Ariu
|
Tetsuro Morimura
|
Eiji Uchibe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing the expected utility value of an underlying human distribution. While prior work has shown the effectiveness of MBR decoding through empirical evaluation, few studies have analytically investigated why the method is effective. As a result of our analysis, we show that, given the size n of the reference hypothesis set used in computation, MBR decoding approaches the optimal solution with high probability at a rate of 𝒪(n-1⁄2), under certain assumptions, even though the language space 𝒴 is significantly larger |𝒴| ≫ n.This result helps to theoretically explain the strong performance observed in several prior empirical studies on MBR decoding. In addition, we provide the performance gap for maximum-a-posteriori (MAP) decoding and compare it to MBR decoding. The result of this paper indicates that MBR decoding tends to converge to the optimal solution faster than MAP decoding in several cases.
Auto-Weighted Group Relative Preference Optimization for Multi-Objective Text Generation Tasks
Yuki Ichihara
|
Yuu Jinnai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Group Relative Policy Optimization (GRPO) is a promising approach to complex, real-world tasks, such as those involving multiple rewards or strict constraints. However, when training GRPO with multiple rewards, the weights of each reward must be decided in advance. Failing to balance the objectives adequately can lead to overfitting or insufficient learning of each reward function. To address this problem, we propose Auto-Weighted Group Relative Policy Optimization (AW-GRPO), which adjusts reward weights during training according to the progress of the learning of each objective so far.We evaluate AW-GRPO on advertising text generation, a real-world problem where the generated text must satisfy multiple objectives, such as quality and diversity, while adhering to the constraints of the media (e.g., maximum number of characters).Our results show that AW-GRPO successfully balances multiple objectives, improving the overall scores while reducing the constraint violation rate.We additionally evaluate AW-GRPO using publicly available benchmark problems for reproducibility, in which we observe the same qualitative result that the proposed method outperforms GRPO.