Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL

Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu


Abstract
As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence—whose mass-covering behavior risks overfitting to imperfect weak signals—with reverse KL divergence. Reverse KL divergence’s zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last linear layer, reverse KL guarantees that it outperforms its weak supervisor by the magnitude of their disagreement. Empirically, we demonstrate that reverse KL and reverse cross-entropy not only enable strong models to outperform those trained with forward KL and standard cross-entropy across most settings, but also exhibit greater robustness to noisy labels.
Anthology ID:
2025.findings-acl.148
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2860–2888
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.148/
DOI:
Bibkey:
Cite (ACL):
Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, and Yong Liu. 2025. Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2860–2888, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL (Yao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.148.pdf