Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study

John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, Hong Yu


Abstract
Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. In this work we examine the impact of a test set question’s difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question’s difficulty. In addition, as DNNs are trained on larger datasets easy questions start to have a higher probability of being answered correctly than harder questions.
Anthology ID:
D18-1500
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4711–4716
Language:
URL:
https://aclanthology.org/D18-1500
DOI:
10.18653/v1/D18-1500
Bibkey:
Cite (ACL):
John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu. 2018. Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4711–4716, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study (Lalor et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/D18-1500.pdf
Attachment:
 D18-1500.Attachment.zip
Video:
 https://vimeo.com/306154181
Data
SNLI