Radha Gulhane


2026

Reinforcement learning optimization policies have traditionally relied on a single reward mechanism, most commonly a model-based reward. Such monolithic rewards often lack confidence calibration across domain-specific tasks and fail to capture diverse aspects of model responses. This approach requires extensive data annotation and reward model training, which is particularly challenging for multimodal models. In this work, we propose and provide a thorough study of hybrid reward and multi-aspect reward modeling. For accuracy and confidence calibration, we introduce a hybrid reward modeling framework that integrates complementary reward paradigms: model-based rewards, in which a learned reward model predicts scalar or vector scores, and rule-based reward, in which domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. Experiments demonstrate that this approach significantly enhances reasoning capabilities: our best-performing 3B model achieves an average improvement of ~9.5% across multimodal benchmarks, with a notable ~16% gain in mathematical reasoning tasks.