Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Lillian Sun, Martin Pawelczyk, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju


Abstract
As large language models continue to advance, ensuring their trustworthiness is critical. However, inaccessible real-world ground truth labels pose a significant challenge in high-stakes domains. Recent studies have highlighted weak-to-strong generalization, where a strong model trained only on a weak model’s labels surpasses the weak model in task performance. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. This is the first work to study this question by examining if a stronger model can enhance trustworthiness when fine-tuned on a weaker model’s labels, a paradigm we term weak-to-strong trustworthiness. To address this, we introduce two fundamental fine-tuning strategies that leverage trustworthiness regularization during the fine-tuning of the weak model and the weak-to-strong transfer. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial robustness, and OOD robustness, show significant improvement in trustworthiness generalization when both models were regularized, others, like privacy, do not exhibit signs of weak-to-strong trustworthiness. Our results highlight the potential of weak-to-strong trustworthiness as a practical pathway for enhancing the trustworthiness of increasingly capable AI systems, even under imperfect real-world conditions.
Anthology ID:
2026.acl-long.2163
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46625–46647
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2163/
DOI:
Bibkey:
Cite (ACL):
Lillian Sun, Martin Pawelczyk, Zhenting Qi, Aounon Kumar, and Himabindu Lakkaraju. 2026. Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46625–46647, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models (Sun et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2163.pdf
Checklist:
 2026.acl-long.2163.checklist.pdf