Verifying the Subjective: Structured Multilingual Rewards for Low-Resource Alignment

Jiu Sha, Mengxiao Zhu


Abstract
Aligning LLMs in low-resource multilingual settings faces a fundamental reward bottleneck: scalar rewards lack cultural generalization, while unstructured critiques remain noisy and unverifiable. To bridge this gap, we introduce a Structured Multilingual Reward Modeling Framework that extends Reinforcement Learning with Verifiable Rewards (RLVR) to subjective and open-ended tasks. The framework unifies three core components to transform abstract quality into concrete supervision: (1) a Structured Checklist Schema decomposing evaluation into granular universal reasoning steps and task-specific criteria; (2) Structured Generative Critique Modeling, which produces rubric-aligned critiques with grounded justifications; and (3) Adaptive Multilingual Reward Optimization, integrating reasoning quality and language consistency into a verifiable objective. We integrate this framework into a bootstrapped Group Relative Policy Optimization pipeline, augmented by length-aware normalization and variance stabilization to ensure stability. Extensive experiments on a newly constructed suite covering 7 subjective task categories across 50 low-resource languages demonstrate that this checklist-driven approach yields substantial improvements in reasoning capability and response quality, particularly in settings where traditional reward models exhibit significant degradation. We publicly release our models and the corresponding evaluation benchmark to facilitate further research. Our code is available at https://github.com/Shajiu/SGCM.
Anthology ID:
2026.findings-acl.1174
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23442–23474
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1174/
DOI:
Bibkey:
Cite (ACL):
Jiu Sha and Mengxiao Zhu. 2026. Verifying the Subjective: Structured Multilingual Rewards for Low-Resource Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23442–23474, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Verifying the Subjective: Structured Multilingual Rewards for Low-Resource Alignment (Sha & Zhu, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1174.pdf
Checklist:
 2026.findings-acl.1174.checklist.pdf