Can we trust the evaluation on ChatGPT?

Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-yeol Ahn


Abstract
ChatGPT, the first large language model with mass adoption, has demonstrated remarkableperformance in numerous natural language tasks. Despite its evident usefulness, evaluatingChatGPT’s performance in diverse problem domains remains challenging due to the closednature of the model and its continuous updates via Reinforcement Learning from HumanFeedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study in stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.
Anthology ID:
2023.trustnlp-1.5
Volume:
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, Rahul Gupta
Venue:
TrustNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–54
Language:
URL:
https://aclanthology.org/2023.trustnlp-1.5
DOI:
10.18653/v1/2023.trustnlp-1.5
Bibkey:
Cite (ACL):
Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-yeol Ahn. 2023. Can we trust the evaluation on ChatGPT?. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 47–54, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Can we trust the evaluation on ChatGPT? (Aiyappa et al., TrustNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2023.trustnlp-1.5.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-3/2023.trustnlp-1.5.mp4