Reward Generalization in RLHF: A Topological Perspective

Tianyi Alex Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, Yaodong Yang


Abstract
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of **reward generalization** in reinforcement learning from human feedback (RLHF), focusing on the **topology of information flow** at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present *induced Bayesian networks* to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose **reward modeling from tree-structured preference information**. It is shown to reduce reward uncertainty by up to 𝛩(log n/loglog n) times compared to baselines, where n is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization *for free* via topology design, while *reducing* the amount of data requiring annotation.
Anthology ID:
2025.findings-acl.820
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15884–15930
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.820/
DOI:
Bibkey:
Cite (ACL):
Tianyi Alex Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, and Yaodong Yang. 2025. Reward Generalization in RLHF: A Topological Perspective. In Findings of the Association for Computational Linguistics: ACL 2025, pages 15884–15930, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Reward Generalization in RLHF: A Topological Perspective (Qiu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.820.pdf