Abstract
Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. In this work we examine the impact of a test set question’s difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question’s difficulty. In addition, as DNNs are trained on larger datasets easy questions start to have a higher probability of being answered correctly than harder questions.- Anthology ID:
- D18-1500
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4711–4716
- Language:
- URL:
- https://aclanthology.org/D18-1500
- DOI:
- 10.18653/v1/D18-1500
- Cite (ACL):
- John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu. 2018. Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4711–4716, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study (Lalor et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/D18-1500.pdf
- Data
- SNLI