Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance

Shira Wein, Christopher Homan, Lora Aroyo, Chris Welty


Abstract
Among the problems with leaderboard culture in NLP has been the widespread lack of confidence estimation in reported results. In this work, we present a framework and simulator for estimating p-values for comparisons between the results of two systems, in order to understand the confidence that one is actually better (i.e. ranked higher) than the other. What has made this difficult in the past is that each system must itself be evaluated by comparison to a gold standard. We define a null hypothesis that each system’s metric scores are drawn from the same distribution, using variance found naturally (though rarely reported) in test set items and individual labels on an item (responses) to produce the metric distributions. We create a test set that evenly mixes the responses of the two systems under the assumption the null hypothesis is true. Exploring how to best estimate the true p-value from a single test set under different metrics, tests, and sampling methods, we find that the presence of response variance (from multiple raters or multiple model versions) has a profound impact on p-value estimates for model comparison, and that choice of metric and sampling method is critical to providing statistical guarantees on model comparisons.
Anthology ID:
2023.findings-acl.196
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3138–3161
Language:
URL:
https://aclanthology.org/2023.findings-acl.196
DOI:
10.18653/v1/2023.findings-acl.196
Bibkey:
Cite (ACL):
Shira Wein, Christopher Homan, Lora Aroyo, and Chris Welty. 2023. Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3138–3161, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance (Wein et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2023.findings-acl.196.pdf