MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu


Abstract
Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to **open-domain settings** remains a critical challenge, as **unconstrained generation** entails multi-faceted and often conflicting objectives—such as creativity versus factuality—where rigid, static reward scalarization is inherently suboptimal. To address this, we propose **MAESTRO** (**M**eta-learning **A**daptive **E**stimation of **S**calarization **T**rade-offs for **R**eward **O**ptimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model’s terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.
Anthology ID:
2026.acl-long.1019
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22267–22283
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1019/
DOI:
Bibkey:
Cite (ACL):
Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, and Ting Liu. 2026. MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22267–22283, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization (Zhao et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1019.pdf
Checklist:
 2026.acl-long.1019.checklist.pdf