Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models

Ruibo Wang, Jihong Li


Abstract
Direct comparison on point estimation of the precision (P), recall (R), and F1 measure of two natural language processing (NLP) models on a common test corpus is unreasonable and results in less replicable conclusions due to a lack of a statistical test. However, the existing t-tests in cross-validation (CV) for model comparison are inappropriate because the distributions of P, R, F1 are skewed and an interval estimation of P, R, and F1 based on a t-test may exceed [0,1]. In this study, we propose to use a block-regularized 3×2 CV (3×2 BCV) in model comparison because it could regularize the difference in certain frequency distributions over linguistic units between training and validation sets and yield stable estimators of P, R, and F1. On the basis of the 3×2 BCV, we calibrate the posterior distributions of P, R, and F1 and derive an accurate interval estimation of P, R, and F1. Furthermore, we formulate the comparison into a hypothesis testing problem and propose a novel Bayes test. The test could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests. Three experiments with regard to NLP chunking tasks are conducted, and the results illustrate the validity of the Bayes test.
Anthology ID:
P19-1405
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4135–4145
Language:
URL:
https://aclanthology.org/P19-1405
DOI:
10.18653/v1/P19-1405
Bibkey:
Cite (ACL):
Ruibo Wang and Jihong Li. 2019. Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4135–4145, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models (Wang & Li, ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/P19-1405.pdf