How to Mitigate Overfitting in Weak-to-strong Generalization?

Junhao Shi; Qinyuan Cheng; Zhaoye Fei; Yining Zheng; Qipeng Guo; Xipeng Qiu (邱锡鹏)

How to Mitigate Overfitting in Weak-to-strong Generalization?

Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, Xipeng Qiu

Abstract

Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of **superalignment**. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR (Performance Gap Recovered) compared to naive weak-to-strong generalization, even achieving up to 100% PGR on some models.

Anthology ID:: 2025.acl-long.784
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16100–16118
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.784/
DOI:
Bibkey:
Cite (ACL):: Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, and Xipeng Qiu. 2025. How to Mitigate Overfitting in Weak-to-strong Generalization?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16100–16118, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: How to Mitigate Overfitting in Weak-to-strong Generalization? (Shi et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.784.pdf

PDF Cite Search Fix data